Avoid WordPress Duplicate Content Problems With Google
The best way to ensure a web page ranks well in Google keyword searches is to make sure it is the only one on the web that includes the content on the page. In this way you avoid several web pages all having a somewhat equal possibility of being judged relevant for the particular keyword search. This increases the chance that this unique page will outrank other quite independent web pages that cover the same topic. That’s the theory and it seems to work out well in practice.
Wordpress is a great software for producing blogs but out-of-the-box the WordPress content management system produces a series of pages that all contain the same content. Just see the concerns expressed in this WebmasterWorld thread about WordPress And Google: Avoiding Duplicate Content Issues where several coding suggestions were offered to avoid the problems. More recently, David Bradley has suggested that something called the canonical link element can be the solution to Avoiding Duplicate Content Penalties.
We should quickly add that this is not an inherent weakness of WordPress alone since many other CMSs will suffer from similar problems. It is a well known problem and you can find an excellent article on how to Avoid Duplicate Content on Wordpress Websites, which gives the appropriate steps to take. The most important step of all is to have the right robots.txt file.
I wondered how well people were grappling with this duplicate content problem and decided to check out some of the Technorati’s Blogger Central / top 100 blogs. In particular I thought a check of their robots.txt files would give an indication on whether they had tried to solve the problem. Here is what I found for the robots.txt files for the most popular 8 blogs.
- The Huffington Post
- TechCrunch
- Engadget
- Boing Boing
- Mashable!
- Lifehacker
- Ars Technica
- Stuff White People Like
# All robots will spider the domain User-agent: * Disallow: # Disallow directory /backstage/ User-agent: * Disallow: /backstage/
User-agent: * Disallow: /*/feed/ Disallow: /*/trackback/
(empty)
User-agent: * Disallow: /cgi-bin
User-agent: * Disallow: /feed Disallow: /*/feed/ Disallow: /*/trackback/ Disallow: /adcentric Disallow: /adinterax Disallow: /atlas Disallow: /doubleclick Disallow: /eyereturn Disallow: /eyewonder Disallow: /klipmart Disallow: /pointroll Disallow: /smartadserver Disallow: /unicast Disallow: /viewpoint Disallow: /LiveSearchSiteAuth.xml Disallow: /mashableadvertising2.xml Disallow: /rpc_relay.html Disallow: /browser.html Disallow: /canvas.html User-agent: Fasterfox Disallow: /
User-Agent: Googlebot Disallow: /index.xml$ Disallow: /excerpts.xml$ Allow: /sitemap.xml$ Disallow: /*view=rss$ Disallow: /*?view=rss$ Disallow: /*format=rss$ Disallow: /*?format=rss$ Disallow: /*?mailto=true
User-agent: * Disallow: /kurt/ Disallow: /errors/
User-agent: IRLbot Crawl-delay: 3600 User-agent: * Disallow: /next/ # har har User-agent: * Disallow: /activate/ User-agent: * Disallow: /signup/ User-agent: * Disallow:
As you may notice, the most popular blogs seem to have a singular disregard for this issue with minimal robots.txt files. As you come down the list, it would seem that even these top blogs realize the importance of limiting what the search engine robots crawl and index.
The impetus for exploring this issue came after noticing an additional complication that results if you put An Elegant Face On Your WordPress Blog by using Multiple WordPress Loops.
This could have resulted in many extra web pages that humans would likely not see but search engine spiders would certainly crawl. Changes were made in the site architecture to avoid this. To avoid other potential duplicate content problems, the current robots.txt file for this blog appears as follows:
User-agent: * Disallow: /wp-login.php Disallow: /wp-admin/ Disallow: /wp-register.php Disallow: /wp-login.php?action=lostpassword Disallow: /index.php?paged Disallow: /?m Disallow: /test/ Disallow: /feed/ Disallow: /?feed=comments-rss2 Disallow: /?feed=atom Disallow: /?s= Disallow: /index.php?s Disallow: /wp-trackback Disallow: /xmlrpc Disallow: /?feed=rss2&p
Conclusion
Getting the robots.txt file correct is one of the easiest ways of increasing the visibility of your blog pages in search engine keyword searches. Leaving two essentially similar web pages means that the two divide up the ‘relevance’ that a single web page would have. That means approaching a 50% reduction in potential keyword ranking. Perhaps the top blogs can ignore such improvements but most of us should not. Check out what the spiders may crawl by doing an evaluation of your website with Xenu Link Sleuth. We should carefully consider our robots.txt files and make sure they are doing an effective job. Is yours?
Update
Andy Beard added a comment that he has concerns about using the robots.txt file as a solution to the WordPress Duplicate Content problem. He explained these in a post some time ago called SEO Linking Gotchas Even The Pros Make. There is much food for thought there and we will follow up in a subsequent post.








Go To Top

in the SEO Services Marketplace








June 4th, 2009 at 11:19 am
http://www.xml-sitemaps.com/ – Another good way of checking broken links while generating a site map for your content website. There is a wordpress plugin of this same script, but it doesn’t provide the broken link checking.
June 4th, 2009 at 11:47 am
Great and timely post! I know about the duplicate content issues in WP and wanted to brush up on the details because I’ve just launched another blog. This is exactly what I was looking for.
Here’s one other potential duplicate content issue that I didn’t notice mentioned here:
True you want Google to see your individual blog post page as the only page with the posts content. However, by default WordPress also publishes the full blog post on the home page of your blog. One common way to avoid this is by using the “more” comment tag under the first paragraph of your blog post and then only that first paragraph will appear on the home page.
I don’t want any of the tag pages, archive pages, or category pages indexed. I do however want these pages followed because they eventually lead to the posts which I have at the bottom of the category silos.
June 4th, 2009 at 12:54 pm
Ugh I HATE duplicate content. I had to get rid of my entire Blog API for Drupal because it kept duplicating my content. I wish I would have read your post. Great post!
June 4th, 2009 at 7:42 pm
This is something I need to check on my site… I’ve heard a lot about the duplicate content issue with Wordpress, but I haven’t ever followed up on it. Thanks for the informative post.
~ Kristi
June 5th, 2009 at 10:35 am
You are creating hanging/dangling pages
June 5th, 2009 at 3:52 pm
You raise an interesting point, Andy. I appreciate the information you provided in our e-mail exchange. As a result I re-examined the blog architecture and made some small changes to it. I slightly revised the post in consequence and added an Update comment. Thanks for your inputs.
June 5th, 2009 at 9:22 pm
[...] Cre8asite teaches us how to avoid Wordpress duplicate content problems with Google. [...]
June 6th, 2009 at 4:00 am
[...] Cre8asite teaches us how to avoid Wordpress replicate problems. [...]
June 8th, 2009 at 10:39 am
Very well written post however, I would recommend that you turn the No Follow off in your comment section.
Keep up the good work.
June 8th, 2009 at 10:59 am
Thanks for the kind words. In fact the nofollow is turned off using a WordPress plugin.
June 8th, 2009 at 12:03 pm
Finally someone who can write a good blog ! . This is the kind of information that is useful to those want to increase their SERP’s. I loved your post and will be telling others about it. Subscribing to your RSS feed now. Thanks
June 8th, 2009 at 3:30 pm
Good post – had to deal with this with a new client recently.
June 15th, 2009 at 6:32 am
Good points even if this post is a little bit old. I’ve been getting used to using Wordpress for a few months only but my question is similar to the commenter above.
What about tags? I see Google indexing not only my main pages but if I have 10 tags per post then that’s another 10 pages, theoretically, that Google is indexing but, it’s obviously duplicate content barring 1 word, the tag itself.
Any tips?
June 15th, 2009 at 6:41 am
Both for Tag Archive pages and Category Archive pages, I suggest only showing a list of the post Titles. That avoids all duplicate content issues.
June 15th, 2009 at 9:34 am
Good post, have been looking for a good solution for duplicate content for a while.
June 16th, 2009 at 3:54 pm
I’m sure by now there will be a plug-in to solve the problem. However, if not then it does seem to get around the problem fine. Good stuff.
June 17th, 2009 at 6:26 am
I think you are absolutely right – robot.txt is a very simple solution for the duplicate content problem. By the way it is very helpful to check robot.txt of other sites looking for any new items in the file.
June 22nd, 2009 at 7:43 am
Many bloggers are still confused about duplicate content. Is there useful wp plugin to avoid duplicate content problem in wordpress? thanks for sharing.
June 22nd, 2009 at 7:54 am
Unfortunately this problem cannot be handled just by a single plugin. There are several steps to get it right.
June 22nd, 2009 at 10:29 am
Good article. This is an excellent way to boost the optimization of a wordpress website. Unfortunately for those who aren’t technically inclined it is also a bit annoying. I also agree with the first commenter, an xml map is an excellent way to make sure that google won’t crawl any pages that you didn’t mean for it to crawl.
June 23rd, 2009 at 2:43 pm
With a website that isnt fully coded correctly it can make an seo’s life a living nightmare and it doesnt matter how many great links you get for the site it wont move in the rankings until you get this problem corrected. An Xml sitemap is a great first step but lik ethe article says, all of the coding should be fixed before attempting off-page seo techniques.
July 2nd, 2009 at 9:49 pm
Thanks for sharing the useful information.Another way of avoiding this problem within a site is with the use of a duplicate content checker. There are plenty of these available online and all it requires is you to enter the URL of the site that you wish to be analyzed. Within a matter of minutes, you can see if there are any problems and if so, make the necessary changes.
July 3rd, 2009 at 11:15 pm
Word Press is an excellent blogging software but can produce many web pages with duplicate content.