Our Mindfly Blog

Website Design and Development

Random creative design element

Url rewriting and SEO duplicate content issues - Canonicalization, HTTP Headers and other topics that can keep you from making friends

by Rusty Swayne 14.May 2008 09:35

At one time I was a Civil Engineer. I seriously have the degree. Amongst my friends, it was the generally acknowledged that I was the only one that was not allowed to talk about their job when we got together. And yep, nothing has changed. I would need to be equipped with a defibrillator to keep most of them in a discussion that revolved around a word like canonicalization. However, when it comes to your website and SEO, not understanding this process can really affect the way your site is indexed and perhaps even lead to more serious problems.

Canonicalization is the process of resolving multiple names to a single standard name known as the canonical name (yes, I had to look that up). As the term relates to website urls, many examples you will find will refer to the default page:

  • www.example.com
  • www.example.com/default.aspx
  • example.com/
  • example.com/default.aspx

It may seem intuitive that each of these urls reference the same page. In actuality they are different as the web server could return completely different content for any or all of them.

Why should you care?

If a search engine fails to identify that each of these urls refer to the same page and treats them as separate locations duplicate content issues can really impact the way a site is indexed.

The GoogleBot seems to be particularly finicky compared to other search engine bots with respects to duplicate content and from personal experience I can tell you it can having your indexed page count start to drop can happen as quickly as flipping a switch. I recently ran into the following sitemap warning in Google’s Webmaster Tools:

When we tested a sample of the URLs from your Sitemap, we found that some URLs were not accessible to Googlebot because they contained too many redirects. Please change the URLs in your Sitemap that redirect and replace them with the destination URL (the redirect target). All valid URLs will still be submitted. [?] HTTP Error:

Found: 302 (Moved temporarily) [?]

This is Google speak roughly equivalent to the Soup Nazi way of saying "No soup for you!" as you are banned from their indexing queue and you may begin to see the number of pages on your site start to atrophy.

It did take me quite a bit of time to sort through all of the possibilities and finally come to the conclusion I was probably dealing with a site that had been flagged for duplicate content due to lax a canonical naming convention. Matt Cutts has a great post that goes into this in much more detail but the skinny is that if you need to be meticulous about asserting that every link to a page uses the same canonical name and where possible limit the ambiguity by adding 301 (Moved Permanently) statuses to more problematic points in your sitemap. Basically you are saying to the search engine, I am not trying to trick you I am just trying to get you to where the content is actually located.

Duplicate content issues and indexing problems can quickly become the bane of your existence as they are often very hard to find.

I spent hours trying to figure out the where and why.

Ironically the issue stemmed from the url rewriting implementation (I say ironically because one generally uses rewrites the urls of dynamically generated content specifically so that they can be indexed).

Raw urls are not equivalent to their rewritten form

The most obvious canonical naming problem was the fact that I viewed raw urls as being equivalent to their rewritten form (because the code would eventually redirect there anyway).

Example

http://www.example.com/default.aspx?node=2

Is not the same as

http://www.example.com/2.aspx

even though the later will eventually wind up being rewritten into the first.

I resolved these duplications by testing for the presence of the node querystring parameter in the Application_BeginRequest routine of the Global.asax file. If found I add the following to the response:

Context.Response.Status = "301 Moved Permanently"
Context.Response.AddHeader("Location", "http://www.example.com/2.aspx")

Only rewrite the “slug parameter(s),” do not rewrite every parameter in your query string.

I had been using an old style DNN method of url rewriting that put query string parameters into the rewritten url rather than leaving them alone.

Example

A page without a secondary QueryString Parameter would be written as
http://www.example.com/2.aspx (2 being the slug in this case)
and would be rewritten to
http://www.example.com/default.aspx?node=2.

A page with a secondary QueryString Parameter may be written as
http://www.example.com/node/2/secondary/22/default.aspx
which would be rewritten to
http://www.example.com/default.aspx?node=2&secondary=22.

Resolution:

All of the previously mentioned references were changed to one of the following formats:

http://www.example.com/2.aspx and http://www.example.com/2.aspx?secondary=22

These two pages have the same canonical name and the secondary query string parameter is still being passed.

As a general rule, if the parameter does not contribute to the unique page reference, I would say leave it as a query string parameter (just don’t use the name "id").

While this experience was very frustrating, the fact that the fix wound up being manageable was a relief. It has force me to change the way I do things and given us (Mindfly) an additional set of quality control guidelines to follow.

A couple of related things you should check out

UrlRewritingNet.UrlRewrite (HttpModule)
If you have not checked it out it is worth looking at. It gets great reviews and is very versatile.

ColinCochrane.HttpErrorModule (HttpModule) 
and it's another BlogEngine.net site ... nice!

Comments

Kyle

Kyle said on May 14, 2008 (12:26)...

Zzzzz. Huh? Wha? *wipes drool off face*

No, really, that was fascinating!

Ok, that may not be the right word for it. But seriously, that is good stuff to know. Duplicate urls = bad.


Add comment



(Will show your Gravatar icon)  









Live preview

said on August 28, 2008 (05:35)...


 

Powered by BlogEngine.NET 1.2.0.0. Original Design by Heather Alvis.
Sign in

Bellingham, Washington
Copyright © 2007 Mindfly Inc. All Rights Reserved.