Avoid WordPress Duplicate Content Problems With Google

The best way to ensure a web page ranks well in Google keyword searches is to make sure it is the only one on the web that includes the content on the page. In this way you avoid several web pages all having a somewhat equal possibility of being judged relevant for the particular keyword search. This increases the chance that this unique page will outrank other quite independent web pages that cover the same topic. That’s the theory and it seems to work out well in practice.

WordPress is a great software for producing blogs but out-of-the-box the WordPress content management system produces a series of pages that all contain the same content. Just see the concerns expressed in this WebmasterWorld thread about WordPress And Google: Avoiding Duplicate Content Issues where several coding suggestions were offered to avoid the problems. More recently, David Bradley has suggested that something called the canonical link element can be the solution to Avoiding Duplicate Content Penalties.

We should quickly add that this is not an inherent weakness of WordPress alone since many other CMSs will suffer from similar problems. It is a well known problem and you can find an excellent article on how to Avoid Duplicate Content on WordPress Websites, which gives the appropriate steps to take. The most important step of all is to have the right robots.txt file.

I wondered how well people were grappling with this duplicate content problem and decided to check out some of the Technorati’s Blogger Central / top 100 blogs. In particular I thought a check of their robots.txt files would give an indication on whether they had tried to solve the problem. Here is what I found for the robots.txt files for the most popular 8 blogs.

  1. The Huffington Post
    # All robots will spider the domain
    User-agent: *
    Disallow:
    # Disallow directory /backstage/
    User-agent: *
    Disallow: /backstage/
  2. TechCrunch
    User-agent: *
    Disallow: /*/feed/
    Disallow: /*/trackback/
  3. Engadget
    (empty)
    
  4. Boing Boing
    User-agent: *
    Disallow: /cgi-bin
  5. Mashable!
    User-agent: *
    Disallow: /feed
    Disallow: /*/feed/
    Disallow: /*/trackback/

    Disallow: /adcentric
    Disallow: /adinterax
    Disallow: /atlas
    Disallow: /doubleclick
    Disallow: /eyereturn
    Disallow: /eyewonder
    Disallow: /klipmart
    Disallow: /pointroll
    Disallow: /smartadserver
    Disallow: /unicast
    Disallow: /viewpoint

    Disallow: /LiveSearchSiteAuth.xml
    Disallow: /mashableadvertising2.xml
    Disallow: /rpc_relay.html

    Disallow: /browser.html
    Disallow: /canvas.html

    User-agent: Fasterfox
    Disallow: /
  6. Lifehacker
    User-Agent: Googlebot
    Disallow: /index.xml$
    Disallow: /excerpts.xml$
    Allow: /sitemap.xml$
    Disallow: /*view=rss$
    Disallow: /*?view=rss$
    Disallow: /*format=rss$
    Disallow: /*?format=rss$
    Disallow: /*?mailto=true
  7. Ars Technica
    User-agent: *
    Disallow: /kurt/
    Disallow: /errors/
  8. Stuff White People Like
    User-agent: IRLbot
    Crawl-delay: 3600

    User-agent: *
    Disallow: /next/

    # har har
    User-agent: *
    Disallow: /activate/

    User-agent: *
    Disallow: /signup/

    User-agent: *
    Disallow:

As you may notice, the most popular blogs seem to have a singular disregard for this issue with minimal robots.txt files. As you come down the list, it would seem that even these top blogs realize the importance of limiting what the search engine robots crawl and index.

The impetus for exploring this issue came after noticing an additional complication that results if you put An Elegant Face On Your WordPress Blog by using Multiple WordPress Loops.

This could have resulted in many extra web pages that humans would likely not see but search engine spiders would certainly crawl. Changes were made in the site architecture to avoid this. To avoid other potential duplicate content problems, the current robots.txt file for this blog appears as follows:

User-agent: *
Disallow: /wp-login.php
Disallow: /wp-admin/
Disallow: /wp-register.php
Disallow: /wp-login.php?action=lostpassword
Disallow: /index.php?paged
Disallow: /?m
Disallow: /test/
Disallow: /feed/
Disallow: /?feed=comments-rss2
Disallow: /?feed=atom
Disallow: /?s=
Disallow: /index.php?s
Disallow: /wp-trackback
Disallow: /xmlrpc
Disallow: /?feed=rss2&p

Conclusion

Getting the robots.txt file correct is one of the easiest ways of increasing the visibility of your blog pages in search engine keyword searches. Leaving two essentially similar web pages means that the two divide up the ‘relevance’ that a single web page would have. That means approaching a 50% reduction in potential keyword ranking. Perhaps the top blogs can ignore such improvements but most of us should not. Check out what the spiders may crawl by doing an evaluation of your website with Xenu Link Sleuth. We should carefully consider our robots.txt files and make sure they are doing an effective job. Is yours?

Update

Andy Beard added a comment that he has concerns about using the robots.txt file as a solution to the WordPress Duplicate Content problem. He explained these in a post some time ago called SEO Linking Gotchas Even The Pros Make. There is much food for thought there and we will follow up in a subsequent post.

Foolish Footers

 
Footer – the foundation of a building

The footers we are talking about here are those defined by Google as follows: Text printed in the bottom margin of each page in a word processing document. Although as you will find later, that other foundation definition is worth thinking about. Here in particular, we are talking about the online versions on which by coincidence the knowledgeable Ann Smarty has recently offered the following advice: handle your site footers wisely. In summary she concludes:

  • make your website footer relevant and useful;
  • don’t add too many elements to the footer – it should be clean and concise;
  • focus on people (SEO value of the footer is too insignificant anyway);
  • follow the common fashion: people want to see common elements at these common places.


As a general rule, that seems eminently sensible advice. However I noticed that two very successful bloggers, Darren Rowse and John Chow adopted a somewhat different approach. Go to either of their blogs and scroll down to the bottom of the webpage. What do you find? In both cases there is a full screen of footer information. That got me thinking.

So often our approach to online web pages is conditioned by our much longer association with the printed page. That is where the word footer comes from. It suggests minimal content. However consider the way in which many people arrive at a web page. Either they are going there for content since someone gave them a link or they did a keyword search and ended up at that page, again looking for content. Most of them are not interested in any information about the blog owner or the rest of the blog as they arrive.

Of course the blog owner may wish them to look at advertisements which help to monetize the blog and ensure its survival. If those advertisements are from Google, then Google is working very hard to provide advertisements that will be of interest to visitors to the web page. If so there is every incentive to ensure that both content and advertisements appear ‘above the fold’, in other words on the initial screen that is viewed.

If anyone wants more information on other items in the blog or the blog author, they are certainly motivated to wander around a little and find what they are looking for. This suggests such information can be ‘below the fold’ since visitors may naturally scroll down to find such information. In consequence this blog now has an extended footer giving even more information than those of Darren Rowse and John Chow. By clicking on the link to Full Blog Info, your screen will show the footer, which is about a screenful on a 1024 x 768 resolution monitor. I believe it is a very logical approach, even though it seems to go against standard practice.

It may not appeal to everyone since it is somewhat unusual. However I don’t believe it’s foolish and I am most interested in visitors’ reactions. Why not add your thoughts on how this different approach works for you.