Should Google Have Smarter Robots?

 
An Open Letter To Matt Cutts On Robots

Dear Mr. Cutts,

I have two purposes in writing. First I would like to offer my support on the suppression of the Google Supplemental Results label. Secondly I would like to offer a simple suggestion that perhaps could reduce the concerns that some webmasters have on the Google Supplemental Index.

On the Supplemental label suppression, although it provided some information it really was too crude a measure. Anyone who needed such an imprecise signal of weak performance would likely not be very effective in dealing with it. The Supplemental Index was introduced for computational reasons to provide the best balance between speed of computation and relevancy of results, at least in Google’s estimation. It may appear to separate the sheep from the goats but this is only a problem if one of your sheep looks too much like a goat.

Where this has turned out to be a problem is with blogs. A recent post by Michael Gray, How WordPress Makes Comments SEO Unfriendly, points out how this can happen. As he said:

I love WordPress I really do. It makes it really easy to publish, however the WordPress developers really need some help sometimes. It seems when there is a choice to make things SE friendly, more often than not they make the worst choice possible.

The big issue he is describing is that blogs produce RSS news feeds as well as blog postings. There is a certain duplication of content between these and that can be a trigger to designate Web pages as goats. Goats are housed in the Supplemental Index and tend to be less visible for keyword searches. There’s the problem.

.. and the solution is .. Like any other competent SEO, my instinctive reaction is that if Google has a problem, then it’s up to me to find the solution. Of course the natural answer is an appropriate robots.txt file that will block the Google robots so that they only see one copy of any content.

It then struck me that the blog postings and the RSS news feeds are prepared for human beings and both have value. If anyone has a problem, should it be all those bloggers or should it be Google? If we assume it is Google’s problem, is there any obvious solution.

Once my mind was thinking in this direction, a possible solution did come to mind. I apologize, Mr. Cutts, if there is an obvious flaw in what I am about to propose but I felt it was worth bringing to your attention.

What triggered my thoughts was a post you wrote in April 2006. You were explaining with pride, quite rightly, that Google with its crawl caching proxy was reducing the load on websites through visits from your spiders. You had the following diagram to explain the functioning:
Googlebots
Although different Google services would have different Googlebots, any given one would likely use the cached version of the web page if it were reasonably recent. At the time that sounded a great idea. Presumably those cached versions would reside in the regular index unless deemed to be goats and assigned to the Supplemental Index. Unless the diagram is misleading, there is no suggestion that different Googlebots would deal with cached versions that were segregated in some way. Of course images would be handled in their own database (Index) but that is a clear distinction since it deals with non-text content.

If I’m understanding correctly, we now have a somewhat paradoxical situation. The regular Google keyword search deals with standard HTML or equivalent web pages. The Google Blogsearch deals only with RSS news feeds. So in applying these algorithms two quite distinct sets of entities are examined. On the other hand it would seem that all these entities are held in the same database and may be assigned either to the regular index or to the supplemental Index.

If this is a correct ‘big picture’ view, then that leads to my suggestion on smarter robots. In fact it’s only a small increase in smartness. Since Blogsearch and the regular search deal with quite different entities, why not segregate the work of the robots. Some would deal only with news feed type files: others would deal with regular web pages. By keeping them in separate databases, the problem of duplication between feeds and web pages would be avoided.

I hope this suggestion is of value. If it is not, then an explanation of the flaw in the argument may help us all understand better how the Googlebots are behaving.

Respectfully submitted,

Barry Welford

Related: Google Supplemental Label Out, PageRank Next?