Subdomains or Subdirectories One More Time

Perhaps it’s the buzz around the launch of Google Plus, but some other hot topics seem to have gone off the boil. Perhaps the most lively this year was the effect of the introduction of the Panda algorithm to grade the quality of web pages.  An interesting development on this seems to have happened without too much comment as yet.

Continue reading

Technorati Tags: , , ,

PageRank Calculation – Null Hypothesis

Summary

The SEO world continues to be shaken by the hints offered by Matt Cutts last week on the nofollow tag and PageRank Sculpting. Many seem shaken but perhaps people have made assumptions about PageRank that are not true. It is difficult to prove how things really work since Google remains cagey on what really happens. Here we offer a Null Hypothesis, a simple explanation, that people may wish to consider. This should only be replaced by a more complex view, when that can be proven to be better.

Introduction

Andy Beard is a keen observer of the Google scene and poses the key question following the Matt Cutts revelations last week: Is PageRank Sculpting Dead & Can Comments Kill Your PageRank?

Has Google in one quick swipe removed all benefit of Dynamic Linking (old school term) or PageRank sculpting (when it became “trendy”), and potentially caused massive penalties for sites nofollowing links for user generated content and comments?

Important articles he cites on this are:

Not everyone is so concerned and Andrew Goodman frankly states, PageRank Sculpting is Dead? Good Riddance.

How Is PageRank Calculated

The Google website is strangely obscure on how PageRank is calculated. There are a few explanations in Google Answers (no longer supported) by others on such issues as My Page Rank and Page Rank Definition – Proof of convergence and uniqueness. Phil Craven offers a reasonably straightforward account in Google’s PageRank Explained and how to make the most of it. That includes the following:

Notes:
Not all links are counted by Google. For instance, they filter out links from known link farms. Some links can cause a site to be penalized by Google. They rightly figure that webmasters cannot control which sites link to their sites, but they can control which sites they link out to. For this reason, links into a site cannot harm the site, but links from a site can be harmful if they link to penalized sites.

The KISS principle (Keep It Simple, Sweetheart)

The problem in trying to estimate how the Google PageRank process works is that PageRank is only one of over 100 or more factors involved in how web pages rank in keyword searches. Any attempt to prove a given assumption about PageRank involves a typical statistical analysis where one tries to infer an explanation from somewhat fuzzy data. That is where the KISS principle comes in. In such a situation, we should perhaps rely on the approach favored by some great minds.

Of two competing theories or explanations, all other things being equal, the simpler one is to be preferred.
Occam of Occam’s Razor
A scientific theory should be as simple as possible, but no simpler.
Albert Einstein
The Null Hypothesis is presumed true until statistical evidence indicates otherwise.
Sir Roland Fisher

Given the Matt Cutts suggestions that some find perplexing, what is the simplest explanation of the Google PageRank process that could explain what is involved?

The Null Hypothesis

What the PageRank process attempts is mind-boggling. It aims to attach a PageRank value to any hyperlink in the total space of web pages and their inter links. It involves an iterative calculation since the values are interdependent. This PageRank can be considered as the probability that a random surfer (human or robot) will pass down that link as compared with all the other links in the total space (graph).

As Matt Cutts reminded us last week, even if a web page is excluded by its owner using a robots.txt file, it may still get into consideration if it is included in a sitemap file or has a link coming in from another external web page. Having an extra indicator on each link indicating whether it is influential in passing PageRank would increase the complexity of the data enormously. Given that, we propose the following Null Hypothesis (See foot of article for definition of this expression). This of course could be abandoned in favor of a more complex explanation if that could be proven statistically with sufficient confidence or if Google chose to provide a more correct explanation of what is done.

The Null Hypothesis runs as follows. The whole process splits into two phases. The first phase looks at all web pages (URLs) and all associated links to determine the PageRank of each web page and thus the contributions that would flow down each link. There are no exclusions and this calculation process handles all URLs in the total Internet space (graph). Modifiers that website owners may have applied such as robots.txt files or tags such as noindex, nofollow, etc. do not get involved at this stage. For all URLs and links without exception, values such as those illustrated below would be calculated.

PageRank Chart

The second phase of the process involves how these PageRank values are then used within the search algorithms. Here whatever is specified via robots.txt or nofollow tags would apply. The PageRank contribution from nofollow-ed links thus would not be included in the calculation. Also any filtering factors that Google may wish to apply for bad neighborhoods, etc. would only apply in this second phase. The underlying web pages and links would still have a first-phase PageRank calculated but this would in no way influence the second phase results.

This two-phase approach would seem to square with what we have been hearing recently. It is offered very much as a Null Hypothesis, so if someone has an Alternative Hypothesis, we look forward to hearing it. Over to you.

Implications of this Null Hypothesis

In the mean time, if this explanation is true, some obvious considerations apply. The basic PageRank calculation is determined by the total set of URLs and links. The PageRank value for a URL is never changed by modifications that apply in the second phase. All that happens in the second phase is that some of the PageRank contributions are ignored. So the effective PageRank that a URL has in the second phase is always less than or equal to what it had in the first phase. A URL can be more prominent solely because others become less prominent.

Significant changes can only come about through influences that affect the first phase. These relate to the site architecture rather than the interlinkages. That is after all what Matt Cutts was recommending.

Footnote:
See this link for an explanation of Null Hypothesis.

Technorati Tags: , , ,

Someone is wrong on the Internet

Someone is wrong on the Internet

Matt Cutts of Google has an intriguing slip in his post, Something is wrong on the internet!. He is referring to this cartoon by xkcd.

Matt Cutts said something rather than someone. He went on to say:

That comic sums up the internet in one sentence: the scrum of jostling opinions on the web and the optimism that truth can still win out. I was reminded of that comic when someone asked me about a particular way that someone recently tried to get links.

His spam group is perhaps one key way human intervention comes into the Google search process. So his comments later in the post are particularly interesting.

If a website claims to have high-quality information and then deceives the user and serves up malware or off-topic porn, Google considers that spam and takes action on it. Likewise, if a site says that they completely made up a story to get links, Google doesn’t have to trust the links to that site as much.

I really don’t view Google’s role as judging the truthiness of the web. … But if someone is sloppy enough to get caught (or to admit!) making up a fake story, I don’t think Google has to blindly trust those links, either.

It sounds very much as though Google will be acting as the judge. This prompted me to add the following comment to his blog post.

This all seems to be shaking out as it should, Matt. It raised one question in my mind. You did say I don’t think Google has to blindly trust those links, either. I believe Google’s policy is to try to do everything in its search process by computer algorithms since this is scalable. Human intervention should therefore be very limited. Your spam group does that human intervention with an on/off button, I presume, as it applies to clear spam content.

I’m sure many would be interested to know how you treat websites you are no longer blindly trusting. Do you apply the off button for these with a reminder to check again in say six months? Or is it more like a volume control where you apply a down weighting factor? Or again, is it one of those minus X penalties in the SERPs that some talk about?

Since Google is now suggesting it will be more open than it has been in the past, I hope we will get some clarification on this.

Technorati Tags: , , , ,

Removing Spam From The Web

Spam

Neither Hormel, the maker of Spam, nor Monty Python’s Spamalot would appreciate the sentiment in the title. However the rest of us would very much like to see the end of that other spam that accumulates on the web.

One major influence in creating all this rubbish was Google with its search view that the number of inlinks to a web page could be a measure of the importance of that web page. Once every one knew that, the name of the game was to create as many other inlinks as you could. Even though this was against the Google Terms of Service, for a time it seemed to work. The view that inlinks are what counts persists and is as strong as ever even though Google has been improving its ability to root out the spam-producers.

Now Matt Cutts, a Googler with some authority on this topic, has written a very clear explanation of what they’re doing about spam. This was triggered by a new service to create “Undetectable” spam. The follow-up to that is well described by Loren Baker in a post, Matt Cutts vs. V7N Links : Matt Wins. Another sign of the times is that Wikipedia seems to have gone the “No Follow” route in adding this tag to all its outlinks. So these will no longer count as inlinks conferring authority in the Google search process.

One would hope that the message will get around and the mindless creation of inlinks or the search for irrelevant reciprocal linking will cease. Unfortunately too many people currently waste too much time and money on these pursuits and spoil the scene for the rest of us.

Technorati Tags: ,

Search the Web for related articles:
Custom Search