PageRank Calculation – Null Hypothesis

Summary

The SEO world continues to be shaken by the hints offered by Matt Cutts last week on the nofollow tag and PageRank Sculpting. Many seem shaken but perhaps people have made assumptions about PageRank that are not true. It is difficult to prove how things really work since Google remains cagey on what really happens. Here we offer a Null Hypothesis, a simple explanation, that people may wish to consider. This should only be replaced by a more complex view, when that can be proven to be better.

Introduction

Andy Beard is a keen observer of the Google scene and poses the key question following the Matt Cutts revelations last week: Is PageRank Sculpting Dead & Can Comments Kill Your PageRank?

Has Google in one quick swipe removed all benefit of Dynamic Linking (old school term) or PageRank sculpting (when it became “trendy”), and potentially caused massive penalties for sites nofollowing links for user generated content and comments?

Important articles he cites on this are:

  • Google Loses “Backwards Compatibility” On Paid Link Blocking & PageRank Sculpting
  • Google (Maybe) Changes How the PageRank Algorithm Handles Nofollow
  • No Clarification Forthcoming from Google on Nofollow & PageRank Flow
  • Is What’s Good For Google, Good For SEO

Not everyone is so concerned and Andrew Goodman frankly states, PageRank Sculpting is Dead? Good Riddance.

How Is PageRank Calculated

The Google website is strangely obscure on how PageRank is calculated. There are a few explanations in Google Answers (no longer supported) by others on such issues as My Page Rank and Page Rank Definition – Proof of convergence and uniqueness. Phil Craven offers a reasonably straightforward account in Google’s PageRank Explained and how to make the most of it. That includes the following:

Notes:
Not all links are counted by Google. For instance, they filter out links from known link farms. Some links can cause a site to be penalized by Google. They rightly figure that webmasters cannot control which sites link to their sites, but they can control which sites they link out to. For this reason, links into a site cannot harm the site, but links from a site can be harmful if they link to penalized sites.

The KISS principle (Keep It Simple, Sweetheart)

The problem in trying to estimate how the Google PageRank process works is that PageRank is only one of over 100 or more factors involved in how web pages rank in keyword searches. Any attempt to prove a given assumption about PageRank involves a typical statistical analysis where one tries to infer an explanation from somewhat fuzzy data. That is where the KISS principle comes in. In such a situation, we should perhaps rely on the approach favored by some great minds.

Of two competing theories or explanations, all other things being equal, the simpler one is to be preferred.
Occam of Occam’s Razor
A scientific theory should be as simple as possible, but no simpler.
Albert Einstein
The Null Hypothesis is presumed true until statistical evidence indicates otherwise.
Sir Roland Fisher

Given the Matt Cutts suggestions that some find perplexing, what is the simplest explanation of the Google PageRank process that could explain what is involved?

The Null Hypothesis

What the PageRank process attempts is mind-boggling. It aims to attach a PageRank value to any hyperlink in the total space of web pages and their inter links. It involves an iterative calculation since the values are interdependent. This PageRank can be considered as the probability that a random surfer (human or robot) will pass down that link as compared with all the other links in the total space (graph).

As Matt Cutts reminded us last week, even if a web page is excluded by its owner using a robots.txt file, it may still get into consideration if it is included in a sitemap file or has a link coming in from another external web page. Having an extra indicator on each link indicating whether it is influential in passing PageRank would increase the complexity of the data enormously. Given that, we propose the following Null Hypothesis (See foot of article for definition of this expression). This of course could be abandoned in favor of a more complex explanation if that could be proven statistically with sufficient confidence or if Google chose to provide a more correct explanation of what is done.

The Null Hypothesis runs as follows. The whole process splits into two phases. The first phase looks at all web pages (URLs) and all associated links to determine the PageRank of each web page and thus the contributions that would flow down each link. There are no exclusions and this calculation process handles all URLs in the total Internet space (graph). Modifiers that website owners may have applied such as robots.txt files or tags such as noindex, nofollow, etc. do not get involved at this stage. For all URLs and links without exception, values such as those illustrated below would be calculated.

PageRank Chart

The second phase of the process involves how these PageRank values are then used within the search algorithms. Here whatever is specified via robots.txt or nofollow tags would apply. The PageRank contribution from nofollow-ed links thus would not be included in the calculation. Also any filtering factors that Google may wish to apply for bad neighborhoods, etc. would only apply in this second phase. The underlying web pages and links would still have a first-phase PageRank calculated but this would in no way influence the second phase results.

This two-phase approach would seem to square with what we have been hearing recently. It is offered very much as a Null Hypothesis, so if someone has an Alternative Hypothesis, we look forward to hearing it. Over to you.

Implications of this Null Hypothesis

In the mean time, if this explanation is true, some obvious considerations apply. The basic PageRank calculation is determined by the total set of URLs and links. The PageRank value for a URL is never changed by modifications that apply in the second phase. All that happens in the second phase is that some of the PageRank contributions are ignored. So the effective PageRank that a URL has in the second phase is always less than or equal to what it had in the first phase. A URL can be more prominent solely because others become less prominent.

Significant changes can only come about through influences that affect the first phase. These relate to the site architecture rather than the interlinkages. That is after all what Matt Cutts was recommending.

Footnote:
See this link for an explanation of Null Hypothesis.