Of all the topics that come up frequently in SEO discussions, duplicate content is at the head of the list. It comes up in two contexts. The first concerns all those scraper sites that are created by spammers to create backlinks and do this by stealing copy from the original, legitimate authors.
The second context, which is the topic of this article, is the duplicate content that is created by WordPress. There may be many arguments why duplicate content is good for human readers, but it certainly creates problems with the search engines.
The reasons for these problems are really at the heart of a post that Matt Cutts did some years back speaking of his enthusiasm for blog posts and how well Google was handling them. Minty Fresh Indexing he called it. The reason why Google was handling them so well was the relatively new BlogSearch and its ability to deal with news feeds. As he said, a post could be visible in a keyword search within an hour or two as opposed to the much slower spider crawling entry of static web pages.
What was not so clear at the time was the associated problem that news feeds provided a parallel content stream that was almost exactly the same as the blog itself. In this post we will discuss ways that have been suggested to solve this duplicate content problem and describe the gap that has still been left in creating a complete solution. We will then offer a solution to fill that gap.
The Classical Way to Reduce WordPress Blog Duplicate Content
Blogs came with major pluses and major minuses. They naturally have a large number of internal hyperlinks which helps the search engine spiders to identify the full blog structure rapidly. On the other hand they also produce a large number of similar pages, which may cause confusion in keyword searches when different web pages come up with the same content. In particular, the blog Home Page presents particular problems.
Blogs started off as journals or in other words a chronological listing of items the author found worth recording. In WordPress, to avoid the Home Page becoming inordinately long, WordPress offers some code which allows an article to be split to give only a short introductory excerpt and to then give a Read More link to read the remainder of the article. A blog Home Page setup using this More link looks a little like a newspaper front page. It has a series of teaser items which require you to go to another page in the blog to read the full continuation of the article.
With physical newspapers, this approach of ‘Continued on Page Twelve’ seems the right compromise. The newspaper reader can then rapidly scan all the items on the front page and move to whichever article takes their fancy. Physically it would be difficult to maneuver a huge newspaper front page that had all the articles printed out at length.
Applying the same teaser items approach for a blog home page is not necessarily a way human readers will find satisfactory. One can navigate rapidly through a very long scrolling web page if you know that an article is to be found in its entirety farther down the front page. However adopting that approach which does not use the More link presents its own problem.
Many recommend having full post entries in the RSS newsfeed since readers of the newsfeed may prefer to stay with the newsfeed rather than switching across to the blog to read each individual article.
In this case, the news feed and the Home Page will be remarkably similar content. Several different solutions have been proposed to deal with these kind of problems and we will list them here. However they leave a gap, which we will describe.
Solutions To WordPress Duplicate Content Problems
Some of these cover other issues in addition to the duplicate content problem but largely they cover the same ground..
How to Make a WordPress Blog Duplicate Content Safe
November 30th, 2006 – Oleg Ishenko – A good discussion of the Read More approach and related issues.
Make WordPress Search Engine Friendly
March 19, 2007 – a video from Michael Gray
SEO for WordPress – The Complete Guide
March 2007 – Jim Westergren
Fighting Duplicate Content On WordPress
July 25, 2007 – Chris Walker
Avoid WordPress Duplicate Content Problems With Google
an earlier post on this blog on June 3rd, 2009 – using the robots.txt file for the duplicate content problem.
A Search Engine Visibility Gap Sometimes Left By These Solutions
The following logical steps will show you the gap in search engine visibility that can develop where both the Home Page and the RSS Newsfeed contain full posts.
- To avoid duplicate content being indexed, the robots.txt file disallows the feed from being crawled by the robots.
- When a new post is added, the search engines are pinged.
- When the Google robot checks the source of the ping the news feed is excluded from the crawl.
- Accordingly the post content in the feed cannot be included in the Google BlogSearch results, which are based entirely on news feed content.
I base my statements on personal experience but others may find the logical steps controversial. The following references show the debate that has gone before on this.
The WordPress Codex gives as an example WordPress robots.txt file:
Adding A Robots.txt File Has Increased My Google Traffic By 16% In 4 Days
12 April 2007 – Everton notes that the following robots.txt has increased Google traffic by 16% in 4 days. His robots.txt file includes the following
Feeds in the search results?
03 October 2007 Joost de Valk offers the following assurance:
If you’re afraid blocking indexation of your feed might cause you to loose traffic from Google BlogSearch and/or Technorati, it won’t. Google BlogSearch uses FeedFetcher, which doesn’t observe robots.txt, and neither does Technorati. They both seem to be under the impression that pinging a blog search engine is enough consent to get it indexed, while others have suggested that pinging Technorati on behalf of others might be a nice way of improving your Technorati authority.
Webmaster Central Discussion
28 August 2009 – In a discussion on robots.txt disallowing certain URLs, a celebrated Google Employee, JohnMu, offered the following:
Generally speaking, I would also recommend not disallowing access to your feed since this is used by Google BlogSearch.
Overall I would rely on this more recent information from a Google employee than an outsider’s judgment, however much I do respect that individual.
Unfortunately JohnMu’s advice would leave us with both the Home Page and the RSS News Feed visible to the search engines and representing duplicate content. How do we avoid that and still make sure that blog posts are found in Google BlogSearch?
Does Feedburner Provide A Solution To This Google BlogSearch Gap?
One way that might come to mind is to register (burn a feed) with what is now Google Feedburner. The feeds created on Feedburner are then equivalent to the Feeds that are produced by the blog itself. However they would still represent duplicate content just like the original feeds so that is not a solution.
A Duplicate Content Solution With The LMNHP Approach
The LMNHP Approach, in addition to the other advantages it gives for search engine visibility of the original blog posts, provides a solution for the duplicate content BlogSearch problem. With LMNHP, there is now no regular web page on the website that is similar in content to the RSS News Feed with its string of full post entries. The ‘front page’ of the blog contains only a single blog post entry. In consequence the RSS News Feed does not need to be blocked from robots and is available to the Google Blog Search process as JohnMu recommended.
As far as I am aware, there is no other solution to this duplicate content problem if the blog Home Page contains a series of full posts and the RSS news feed also contains the full posts. This points once more to the somewhat illogical structure of the Blog Home Page, whether it contains full posts or a series of teaser items. The simplest answer once more is to adopt the No Home Page approach.