Google Duplicate Content And WordPress – An Unresolved Problem

Of all the topics that come up frequently in SEO discussions, duplicate content is at the head of the list.  It comes up in two contexts.  The first concerns all those scraper sites that are created by spammers to create backlinks and do this by stealing copy from the original, legitimate authors. 

The second context, which is the topic of this article, is the duplicate content that is created by WordPress.  There may be many arguments why duplicate content is good for human readers, but it certainly creates problems with the search engines.

The reasons for these problems are really at the heart of a post that Matt Cutts did some years back speaking of his enthusiasm for blog posts and how well Google was handling them.  Minty Fresh Indexing he called it.  The reason why Google was handling them so well was the relatively new BlogSearch and its ability to deal with news feeds.  As he said, a post could be visible in a keyword search within an hour or two as opposed to the much slower spider crawling entry of static web pages.

What was not so clear at the time was the associated problem that news feeds provided a parallel content stream that was almost exactly the same as the blog itself.  In this post we will discuss ways that have been suggested to solve this duplicate content problem and describe the gap that has still been left in creating a complete solution.  We will then offer a  solution to fill that gap.

The Classical Way to Reduce WordPress Blog Duplicate Content

Blogs came with major pluses and major minuses.  They naturally have a large number of internal hyperlinks which helps the search engine spiders to identify the full blog structure rapidly.  On the other hand they also produce a large number of similar pages, which may cause confusion in keyword searches when different web pages come up with the same content.  In particular, the blog Home Page presents particular problems.

Blogs started off as journals or in other words a chronological listing of items the author found worth recording.  In WordPress,  to avoid the Home Page becoming inordinately long, WordPress offers some code which allows an article to be split to give only a short introductory excerpt and to then give a Read More link to read the remainder of the article.  A blog Home Page setup using this More link looks a little like a newspaper front page.  It has a series of teaser items which require you to go to another page in the blog to read the full continuation of the article.

With physical newspapers, this approach of ‘Continued on Page Twelve’ seems the right compromise.  The newspaper reader can then rapidly scan all the items on the front page and move to whichever article takes their fancy.  Physically it would be difficult to maneuver a huge newspaper front page that had all the articles printed out at length.

Applying the same teaser items approach for a blog home page is not necessarily a way human readers will find satisfactory.  One can navigate rapidly through a very long scrolling web page if you know that an article is to be found in its entirety farther down the front page.  However adopting that approach which does not use the More link presents its own problem.

Many recommend having full post entries in the RSS newsfeed since readers of the newsfeed may prefer to stay with the newsfeed rather than switching across to the blog to read each individual article.

In this case, the news feed and the Home Page will be remarkably similar content.  Several different solutions have been proposed to deal with these kind of problems and we will list them here.  However they leave a gap, which we will describe.

Solutions To WordPress Duplicate Content Problems

Some of these cover other issues in addition to the duplicate content problem but largely they cover the same ground..

How to Make a WordPress Blog Duplicate Content Safe
November 30th, 2006 – Oleg Ishenko – A good discussion of the Read More approach and related issues.

Make WordPress Search Engine Friendly
March 19, 2007 – a video from Michael Gray

SEO for WordPress – The Complete Guide
March 2007 – Jim Westergren

Fighting Duplicate Content On WordPress
July 25, 2007 – Chris Walker

Avoid WordPress Duplicate Content Problems With Google
an earlier post on this blog on June 3rd, 2009 – using the robots.txt file for the duplicate content problem.

A Search Engine Visibility Gap Sometimes Left By These Solutions

The following logical steps will show you the gap in search engine visibility that can develop where both the Home Page and the RSS Newsfeed contain full posts.

  1. To avoid duplicate content being indexed, the robots.txt file disallows the feed from being crawled by the robots.
  2. When a new post is added, the search engines are pinged.
  3. When the Google robot checks the source of the ping the news feed is excluded from the crawl.
  4. Accordingly the post content in the feed cannot be included in the Google BlogSearch results, which are based entirely on news feed content.

I base my statements on personal experience but others may find the logical steps controversial.  The following references show the debate that has gone before on this.

Search Engine Optimization for WordPress

The WordPress Codex gives as an example WordPress robots.txt file:

User-agent: *
Disallow: /feed

Adding A Robots.txt File Has Increased My Google Traffic By 16% In 4 Days
12 April 2007 – Everton notes that the following robots.txt has increased Google traffic by 16% in 4 days.  His robots.txt file includes the following

User-agent: *
Disallow: /feed/

Feeds in the search results?
03 October 2007 Joost de Valk offers the following assurance:

If you’re afraid blocking indexation of your feed might cause you to loose traffic from Google BlogSearch and/or Technorati, it won’t. Google BlogSearch uses FeedFetcher, which doesn’t observe robots.txt, and neither does Technorati. They both seem to be under the impression that pinging a blog search engine is enough consent to get it indexed, while others have suggested that pinging Technorati on behalf of others might be a nice way of improving your Technorati authority.

Webmaster Central Discussion
28 August 2009 – In a discussion on robots.txt disallowing certain URLs, a celebrated Google Employee, JohnMu, offered the following:

Generally speaking, I would also recommend not disallowing access to your feed since this is used by Google BlogSearch.

Overall I would rely on this more recent information from a Google employee than an outsider’s judgment, however much I do respect that individual.

Unfortunately JohnMu’s advice would leave us with both the Home Page and the RSS News Feed visible to the search engines and representing duplicate content.  How do we avoid that and still make sure that blog posts are found in Google BlogSearch?

Does Feedburner Provide A Solution To This Google BlogSearch Gap?

One way that might come to mind is to register (burn a feed) with what is now Google Feedburner.   The feeds created on Feedburner are then equivalent to the Feeds that are produced by the blog itself.  However they would still represent duplicate content just like the original feeds so that is not a solution.

A Duplicate Content Solution With The LMNHP Approach

The LMNHP Approach, in addition to the other advantages it gives for search engine visibility of the original blog posts, provides a solution for the duplicate content BlogSearch problem.  With LMNHP, there is now no regular web page on the website that is similar in content to the RSS News Feed with its string of full post entries.  The ‘front page’ of the blog contains only a single blog post entry.  In consequence the RSS News Feed does not need to be blocked from robots and is available to the Google Blog Search process as JohnMu recommended.

As far as I am aware, there is no other solution to this duplicate content problem if the blog Home Page contains a series of full posts and the RSS news feed also contains the full posts.  This points once more to the somewhat illogical structure of the Blog Home Page, whether it contains full posts or a series of teaser items.  The simplest answer once more is to adopt the No Home Page approach.

10 thoughts on “Google Duplicate Content And WordPress – An Unresolved Problem”

  1. Yes, oempak. Adding the source is a nice courtesy to the original author, but it’s still duplication. Duplication is really concerned with whether the search engines think the content is essentially similar. In which case, depending on the exact keyword query either one or the other might be shown. This obviously reduces the chance that the original source article is displayed in a SERP where really it is the most relevant item.

  2. Very good article! Duplicate content is something I work with a lot to avoid. Most importantly, I believe, is to put “no follow” on that kind of functions that can do to make it dublicate content. So which “tags” “categories” “date” and so on. And have “do follow” on liks that does not give duplicate content. For I believe Google dislike sites that do not give out links, either. For their entire system is based precisely on it. 🙂

  3. I use feedburner as well for my feeds instead of the default WP. The problem is it seems to automatically create it no matter. The WP feed isnot linked anywhere on my website but you can still access directly. Is there a way to turn that off?

  4. A great tool that I used for my San Diego real estate site is DupeFree Pro. However, they recently announced that their program was no longer synchronizing with Google. A fix is apparently in the works. The biggest challenge for me was switching house listings to a new site after they’d been cached on one of my other sites. It was my content, but I didn’t want the homes descriptions to be considered duplicates.

  5. I always thought that google loved wordpress so it would understand the fact that thats what wordpress does and understand that and not give penaltys for dupe content. Just a point if i use tags in my post does that duplicate each post?

  6. There is no Google penalty as such. It’s just that if either of two pages could be relevant, then the relevance of each compared to other web pages will be reduced. Whereas if there’s only one page that’s relevant, then that will rank higher.

    As for your tags question, it depends what you show on your tags pages. I show only titles so there’s little duplicated content in that.

Comments are closed.