Deep Web, NOW Web – more headaches for Google

The NOW Web is the cyberspace that contains all those packets of information or Instants that are moving around.  Think of that Google Chat message you just received or the indication that you now have 10 Friends online in Facebook or that text message you just received on your cell phone.  Many of them last only for a very short period of time and then disappear without trace.  Only a very minute fraction of these are associated with a hyperlink or Uniform Resource Locator (URI) so can never be crawled.

They are presumably not part of the Information space that Google wishes to catalogue.  They would never have the hyperlink information that allows the Google algorithms with their PageRank concept to offer them as relevant answers to queries.

Now the New York Times has pointed out that part of this NOW Web does persist but is still uncrawlable by crawlers or spiders.  They are using the name, Deep Web, for that and suggesting that this is a ‘Deep Web’ That Google Can’t Grasp.

Search engines rely on programs known as crawlers (or spiders) that gather information by following the trails of hyperlinks that tie the Web together. “The crawlable Web is the tip of the iceberg,” says Anand Rajaraman, co-founder of Kosmix (www.kosmix.com), a Deep Web search start-up.  Kosmix has developed software that matches searches with the databases most likely to yield relevant information, then returns an overview of the topic drawn from multiple sources.

With millions of databases connected to the Web, and endless possible permutations of search terms, there is simply no way for any search engine — no matter how powerful — to sift through every possible combination of data on the fly.

If that is true for the Deep Web, how much more true is it for the very much bigger NOW Web. 

This obviously creates problems for Google with its mission to catalogue all information.  However there are two consolations for Google:

  • Searching these non-crawlable cyberspaces represents an enormous challenge
  • Google can certainly achieve all its financial objectives by focusing on the crawlable Web.

It all depends whether Google will follow Peter Drucker’s advice to Focus, Focus, Focus or whether they wish to dream an impossible dream.

Technorati Tags: , , , , , , ,

Should Google Have Smarter Robots?

 
An Open Letter To Matt Cutts On Robots

Dear Mr. Cutts,

I have two purposes in writing. First I would like to offer my support on the suppression of the Google Supplemental Results label. Secondly I would like to offer a simple suggestion that perhaps could reduce the concerns that some webmasters have on the Google Supplemental Index.

On the Supplemental label suppression, although it provided some information it really was too crude a measure. Anyone who needed such an imprecise signal of weak performance would likely not be very effective in dealing with it. The Supplemental Index was introduced for computational reasons to provide the best balance between speed of computation and relevancy of results, at least in Google’s estimation. It may appear to separate the sheep from the goats but this is only a problem if one of your sheep looks too much like a goat.

Where this has turned out to be a problem is with blogs. A recent post by Michael Gray, How WordPress Makes Comments SEO Unfriendly, points out how this can happen. As he said:

I love WordPress I really do. It makes it really easy to publish, however the WordPress developers really need some help sometimes. It seems when there is a choice to make things SE friendly, more often than not they make the worst choice possible.

The big issue he is describing is that blogs produce RSS news feeds as well as blog postings. There is a certain duplication of content between these and that can be a trigger to designate Web pages as goats. Goats are housed in the Supplemental Index and tend to be less visible for keyword searches. There’s the problem.

.. and the solution is .. Like any other competent SEO, my instinctive reaction is that if Google has a problem, then it’s up to me to find the solution. Of course the natural answer is an appropriate robots.txt file that will block the Google robots so that they only see one copy of any content.

It then struck me that the blog postings and the RSS news feeds are prepared for human beings and both have value. If anyone has a problem, should it be all those bloggers or should it be Google? If we assume it is Google’s problem, is there any obvious solution.

Once my mind was thinking in this direction, a possible solution did come to mind. I apologize, Mr. Cutts, if there is an obvious flaw in what I am about to propose but I felt it was worth bringing to your attention.

What triggered my thoughts was a post you wrote in April 2006. You were explaining with pride, quite rightly, that Google with its crawl caching proxy was reducing the load on websites through visits from your spiders. You had the following diagram to explain the functioning:
googlebot.png
Although different Google services would have different Googlebots, any given one would likely use the cached version of the web page if it were reasonably recent. At the time that sounded a great idea. Presumably those cached versions would reside in the regular index unless deemed to be goats and assigned to the Supplemental Index. Unless the diagram is misleading, there is no suggestion that different Googlebots would deal with cached versions that were segregated in some way. Of course images would be handled in their own database (Index) but that is a clear distinction since it deals with non-text content.

If I’m understanding correctly, we now have a somewhat paradoxical situation. The regular Google keyword search deals with standard HTML or equivalent web pages. The Google Blogsearch deals only with RSS news feeds. So in applying these algorithms two quite distinct sets of entities are examined. On the other hand it would seem that all these entities are held in the same database and may be assigned either to the regular index or to the supplemental Index.

If this is a correct ‘big picture’ view, then that leads to my suggestion on smarter robots. In fact it’s only a small increase in smartness. Since Blogsearch and the regular search deal with quite different entities, why not segregate the work of the robots. Some would deal only with news feed type files: others would deal with regular web pages. By keeping them in separate databases, the problem of duplication between feeds and web pages would be avoided.

I hope this suggestion is of value. If it is not, then an explanation of the flaw in the argument may help us all understand better how the Googlebots are behaving.

Respectfully submitted,

Barry Welford

Related: Google Supplemental Label Out, PageRank Next?

Technorati Tags: , , ,

Make Your Website Search Engine Robot-Friendly

 
Search Engine Robots Read Site Maps Too

In November 2006, all the major search engines for once agreed on new Sitemap standards. Sitemaps.org set out the rules for sitemap files that all the major search engines would follow.

If you use a program such as GSiteCrawler, you can produce a full listing of all the web pages on your website in an XML file: the standard name for this file is sitemap.xml. The search engines do prefer a G-zipped version of this file, usually named sitemap.xml.gz. The GSiteCrawler program produces both versions. Although even Microsoft’s MSN/Live subscribed to this standard, as yet they have not indicated how they wish to implement the standard. The other majors have been more helpful.

A good way to start is via the website for Google’s Webmaster Tools. Once you have loaded your sitemap file to your domain, you can submit this to Google. An advantage of this approach is that Google will then in due course evaluate the sitemap file and indicate any errors therein.

The real news came up last week when Google, Yahoo! and Ask indicated that another route to inform them of the sitemap file is to include a reference to the precise URL for the sitemap file in the robots.txt file. Every domain should have a robots.txt file, even if it is an empty file. Search engine robots (or spiders) will sometimes visit a domain and check only the robots.txt file. This confirms that the domain is live. Without such a file, an error is recorded. Now you can add anywhere in the file, say at the bottom, an additional line that reads as follows:
Sitemap: http://www.yoursite.com/sitemap.xml.gz

The robots.txt file is normally checked often by search engine spiders. By doing the above, you should quickly get the new file picked up. Ask, Google and Yahoo! are all using this robots.txt file approach.

If you have just loaded up a sitemaps file and want to be sure that the sitemap file is picked up ASAP, you can ping the search engines directly. The following hyperlinks are the appropriate way to do this.

Ask:
http://submissions.ask.com/ping? sitemap=http://www.yoursite.com/sitemap.xml.gz
Google:
http://www.google.com/webmasters/sitemaps/ping? sitemap=http://www.yoursite.com/sitemap.xml.gz
Yahoo:
http://search.yahooapis.com/SiteExplorerService/V1/ping? sitemap=http://www.yoursite.com/sitemap.xml.gz

NOTE: The space after ping? should be removed. It is included here to improve the formatting of the blog post.

This should provide all the information you need on the sitemap file and how to alert the search engine robots that you have one. If there are additional points, hopefully someone will add them in the comments.

Related:
What’s new with Sitemaps.org? – Official Google Webmaster Central Blog
Use Your Robots.txt To Publish Your Sitemaps Xml File – Cre8asite Forums Discussion

Technorati Tags: , , , ,