PageRank Calculation – Null Hypothesis

Summary

The SEO world continues to be shaken by the hints offered by Matt Cutts last week on the nofollow tag and PageRank Sculpting. Many seem shaken but perhaps people have made assumptions about PageRank that are not true. It is difficult to prove how things really work since Google remains cagey on what really happens. Here we offer a Null Hypothesis, a simple explanation, that people may wish to consider. This should only be replaced by a more complex view, when that can be proven to be better.

Introduction

Andy Beard is a keen observer of the Google scene and poses the key question following the Matt Cutts revelations last week: Is PageRank Sculpting Dead & Can Comments Kill Your PageRank?

Has Google in one quick swipe removed all benefit of Dynamic Linking (old school term) or PageRank sculpting (when it became “trendy”), and potentially caused massive penalties for sites nofollowing links for user generated content and comments?

Important articles he cites on this are:

Google Loses “Backwards Compatibility” On Paid Link Blocking & PageRank Sculpting
Google (Maybe) Changes How the PageRank Algorithm Handles Nofollow
No Clarification Forthcoming from Google on Nofollow & PageRank Flow
Is What’s Good For Google, Good For SEO

Not everyone is so concerned and Andrew Goodman frankly states, PageRank Sculpting is Dead? Good Riddance.

How Is PageRank Calculated

The Google website is strangely obscure on how PageRank is calculated. There are a few explanations in Google Answers (no longer supported) by others on such issues as My Page Rank and Page Rank Definition – Proof of convergence and uniqueness. Phil Craven offers a reasonably straightforward account in Google’s PageRank Explained and how to make the most of it. That includes the following:

Notes:
Not all links are counted by Google. For instance, they filter out links from known link farms. Some links can cause a site to be penalized by Google. They rightly figure that webmasters cannot control which sites link to their sites, but they can control which sites they link out to. For this reason, links into a site cannot harm the site, but links from a site can be harmful if they link to penalized sites.

The KISS principle (Keep It Simple, Sweetheart)

The problem in trying to estimate how the Google PageRank process works is that PageRank is only one of over 100 or more factors involved in how web pages rank in keyword searches. Any attempt to prove a given assumption about PageRank involves a typical statistical analysis where one tries to infer an explanation from somewhat fuzzy data. That is where the KISS principle comes in. In such a situation, we should perhaps rely on the approach favored by some great minds.

Of two competing theories or explanations, all other things being equal, the simpler one is to be preferred.: Occam of Occam’s Razor
A scientific theory should be as simple as possible, but no simpler.: Albert Einstein
The Null Hypothesis is presumed true until statistical evidence indicates otherwise.: Sir Roland Fisher

Given the Matt Cutts suggestions that some find perplexing, what is the simplest explanation of the Google PageRank process that could explain what is involved?

The Null Hypothesis

What the PageRank process attempts is mind-boggling. It aims to attach a PageRank value to any hyperlink in the total space of web pages and their inter links. It involves an iterative calculation since the values are interdependent. This PageRank can be considered as the probability that a random surfer (human or robot) will pass down that link as compared with all the other links in the total space (graph).

As Matt Cutts reminded us last week, even if a web page is excluded by its owner using a robots.txt file, it may still get into consideration if it is included in a sitemap file or has a link coming in from another external web page. Having an extra indicator on each link indicating whether it is influential in passing PageRank would increase the complexity of the data enormously. Given that, we propose the following Null Hypothesis (See foot of article for definition of this expression). This of course could be abandoned in favor of a more complex explanation if that could be proven statistically with sufficient confidence or if Google chose to provide a more correct explanation of what is done.

The Null Hypothesis runs as follows. The whole process splits into two phases. The first phase looks at all web pages (URLs) and all associated links to determine the PageRank of each web page and thus the contributions that would flow down each link. There are no exclusions and this calculation process handles all URLs in the total Internet space (graph). Modifiers that website owners may have applied such as robots.txt files or tags such as noindex, nofollow, etc. do not get involved at this stage. For all URLs and links without exception, values such as those illustrated below would be calculated.

The second phase of the process involves how these PageRank values are then used within the search algorithms. Here whatever is specified via robots.txt or nofollow tags would apply. The PageRank contribution from nofollow-ed links thus would not be included in the calculation. Also any filtering factors that Google may wish to apply for bad neighborhoods, etc. would only apply in this second phase. The underlying web pages and links would still have a first-phase PageRank calculated but this would in no way influence the second phase results.

This two-phase approach would seem to square with what we have been hearing recently. It is offered very much as a Null Hypothesis, so if someone has an Alternative Hypothesis, we look forward to hearing it. Over to you.

Implications of this Null Hypothesis

In the mean time, if this explanation is true, some obvious considerations apply. The basic PageRank calculation is determined by the total set of URLs and links. The PageRank value for a URL is never changed by modifications that apply in the second phase. All that happens in the second phase is that some of the PageRank contributions are ignored. So the effective PageRank that a URL has in the second phase is always less than or equal to what it had in the first phase. A URL can be more prominent solely because others become less prominent.

Significant changes can only come about through influences that affect the first phase. These relate to the site architecture rather than the interlinkages. That is after all what Matt Cutts was recommending.

Footnote:
See this link for an explanation of Null Hypothesis.

29 thoughts on “PageRank Calculation – Null Hypothesis”

VPS Web Hosting says:

June 9, 2009 at 9:01 pm

This is confusing. So what implication does this have with on page factors? Do we use nofollow links on our site and try to “concentrate” link weight on the pages we thing are appropriate?
jlbraaten says:

June 9, 2009 at 9:02 pm

Any strategy has its point of diminishing marginal returns. Has calculating things out to this degree found that point for SEO? When do you put down the calculator and start focusing on engaging content?
Yura says:

June 10, 2009 at 7:45 am

Site architecture rocks, because it’s where a large number of pages influences another number of your own pages.

Individual links are individual, so bear less weight, but they do matter, too. They are included in the first phase, too, right? It’s what kind of those links are is what the proposed 2nd phase calculates.

That’s why while I agree that having good site architecture is good, some of us want to perfect the other aspects, too.
jaamit says:

June 10, 2009 at 7:51 am

Interesting null hypothesis (thanks for teaching me what that term means btw, and via such a brilliant site too!).

However, if it were true it would also mean that external nofollowed links divide the overall passed PageRank from any page as well as internals. This means that if you had two equal blog posts, both brilliant and both having gained the same number/value of links and having the same amount of PR, the one with 100 nofollowed comments (indicating it sparked a big discussion) would have a LOT less PR to pass on to its followed links than the other, which has a total of 0 comments. That would be a ridiculous situation and if it were true I think it would really screw up the quality of Google’s results.

Matt Cutts’ comments only make sense, IMO, in the following 2 scenarios:

1. It only applies to internal links on a site, ie it is a way to stop sites abusing PR sculpting by nofollowing 90 out of 100 links to pass extra weight through the remaining 10. But you’d think G would clarify this in order for it to be taken notice of.
2. It’s just not true – more something Google wanted to slip out to the webmaster/SEO community to discourage excessive PR sculpting.

Thoughts?
Barry Welford says:

June 10, 2009 at 8:27 am

Excellent comments, jaamit. What you think is a ridiculous situation is exactly the way I am seeing it. Until I hear an alternative hypothesis on how PageRank is calculated that involves all these nofollow and robots.txt blockages, I am assuming that all links are included in the calculation without any modification. After all if it were not so, all you need to do is put half your web pages on one domain and half on another domain with all the usual linkages and what might have been internal links suddenly become external links.

If this null hypothesis is true, it has all sorts of implications and particularly on comments. If the null hypothesis survives scrutiny in the next few days and is not disproved, then I will write further on comments and what to do about them.
Barry Welford says:

June 10, 2009 at 8:37 am

@Yura – yes, all links both internal and external are included in the first phase. But you are right this is only one aspect of ranking well. Content is still King (or Queen). 🙂
Jeff Lancaster says:

June 10, 2009 at 11:47 am

I’d agree that this is starting to look more like a ‘diversionary’ leak from Google to have SEOs lay off excessive PageRank sculpting. On your point, when you look at this from the perspective of a blog post with comments and how that juice will get further diluted it makes very little sense.
Hyperactive Sam says:

June 10, 2009 at 12:14 pm

I think this makes some sense. In other cases we also see Google gathering and processing information and then applying another filter on those results.

For example, we have a client whose site suffered a Google penalty (90 day +50 penalty for overuse of widget links, we think), while the site was penalized its “wonder wheel” terms also acted as if the site was no longer relevant for the terms where it was penalized. Translation: a penalty in this case consisted of Google calibrating the relevancy scores such that the site falls about 50 spots.

sam.
adam says:

June 10, 2009 at 12:34 pm

I’m going to test this null hypothesis. I’m new to SEO and this will be my first experiment, but I will be using free hosting providers to do so since I really don’t have a budget. If I just have a series of subdomains from a free host, could I still get the same link juice flow as I would when having seperate domains? I will have great unique and RELEVANT content on each site with good html, I hear that wordpress has perfected the way they structure their site, so I will be analyzing their html to replicate a similar structure in my sites. I will also have only one or two outgoing links from each site, but the sites will have varying amounts of incoming links. One question that I have for you guys is does it matter how the sites were indexed? If I had a single site indexed and googlebot went through my outgoing links to index my other sites, will it be any different than having those outgoing links be indexed from another domain? Also, will it effect my rankings in a bad way if I just have a static page with good content without having it updated frequently since it does take a while for all sites to get indexed? Believe it or not, I will be using some clever techniques for indexing all my sites very fast.
Sean Ferguson says:

June 10, 2009 at 7:28 pm

How’s this for simple:

The PageRank algorithm handles internal nofollow links differently than external nofollow links.
Barry Welford says:

June 10, 2009 at 7:58 pm

The algorithm is the second phase so I would agree with you, Sean. My question is whether there is a first phase where they calculate a base value of PageRank without constraints (nofollow, robots.txt files, bad neighbourhoods, etc.)
Sean Ferguson says:

June 10, 2009 at 8:57 pm

I agree that your question is a good one. I’m curious as to how we can test it.
Barry Welford says:

June 10, 2009 at 9:14 pm

That’s the beauty of the Null Hypothesis, Sean. I don’t need to test it. 🙂

It is the simplest explanation that fits what Matt Cutts said last week in Seattle. So I’m going to stick with that explanation until somewhat proves to me that it doesn’t fit the facts. Either Google could make a statement that the real explanation is X or someone else could say it’s Y and have some data that shows my Null Hypothesis does not fit their data. That’s the beauty of the method of Occam, Einstein and Sir Roland Fisher.
Sean Ferguson says:

June 10, 2009 at 10:16 pm

H0 is indeed a wonderful construct. I am curious as to what data could be used to support a Ha and reject your H0. I have a hard time envisioning a test that could be devised to this end.
cj says:

June 11, 2009 at 7:25 pm

Hi Barry,

in parts this post was very good but my objection is your take on the “Null-hypothesis” (H0), especially when you say that it doesn’t need to be tested in your comments, as that is actually the main point of it. Your use of it in this post makes no sense. Einstein and co. had a pretty good idea of it but those quotes are not used in their correct context and do not refer to the same thing.

Null-hypothesis is supposed to be tested against your presumption, in fact you have to presume that the opposite is true. To establish whether you are right, you do have to test it, and this is a statistical measure as well. Proving your idea is the “alternate hypothesis” (H1). To refute your ideal you are supposed to “nullify” your own.

“In relation to any experiment we may speak of this hypothesis as the “null hypothesis,” and it should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.” (The coining of the phrase.) The Design of Experiments, Edinburgh: Oliver and Boyd, 1935, p.18.

H0 does not mean that it is true anyway because it’s a “hypothesis” and rejecting the null does not prove H1.

“A scientific theory should be as simple as possible, but no simpler.” – this refers to a “theory” which is quite a different thing to a “hypothesis” (Occam too). Also the Einstein quote is supposed to be slightly amusing, because very few things are simple in science. Yes keep it “simple” but the level of complexity is always intense so that is very relative to what you are doing. It doesn’t mean KISS by any means.

Your link is very good, and does actually explain what I have above. You confuse “theory” with “hypothesis” which is the issue I think. Hypotheses are supposed to be tested through rigorous experiments and repeated ones as well.

Dr Carl Sagan said “There are many hypotheses in science which are wrong. That’s perfectly all right; they’re the aperture to finding out what’s right. Science is a self-correcting process. To be accepted, new ideas must survive the most rigorous standards of evidence and scrutiny.”

If everyone finds your hypothesis to be correct, it becomes a “Theory” but it must be flexible enough to be modified if new data or evidence is introduced. If that is repeatedly true then it becomes law.

I wrote about this here: http://www.scienceforseo.com/tutorials/whats-the-scientific-method/

“Here we offer a Null Hypothesis, a simple explanation, that people may wish to consider. This should only be replaced by a more complex view, when that can be proven to be better.”

You haven’t offered a “Null-hypothesis” because you haven’t tested it in any significant way which it is supposed to be. You haven’t “nullified” your own hypothesis. Obviously it should be kept as simple as possible but not to this extent.

You have readers and SEO’s that listen to you as all the comments prove and rightly so I may add, but in this instance you are misleading. Science if you wish to be involved in that includes peer review and it is always direct but respectful so I hope you appreciate my comments 🙂
Barry Welford says:

June 11, 2009 at 7:46 pm

Excellent contribution to the discussion, cj. I’m not sure we’re disagreeing.

The problem here is that we are not trying to determine some natural law but rather we are trying to infer how Google may be applying the body of knowledge that they have revealed ‘through a glass darkly’ by a whole collection of patents and papers.

As you rightly say, you never prove the Null Hypothesis but rather you disprove it when the evidence shows the Alternative Hypothesis fits the ‘facts’ better than the Null Hypothesis. The facts in this case are some assertions that Matt Cutts made about how nofollow now applies. If someone has a better Null Hypothesis as a starting point I might well accept that as the Null. However I could see how they would apply my version of the Null Hypothesis (a two phase approach) from a computational point of view. So I’m comfortable starting from there.

Of course the best way for the argument to be resolved would be for Google to refute the Null Hypothesis. Perhaps that will be the outcome.
cj says:

June 11, 2009 at 8:09 pm

Yes agreed to an extent. Usually it’s up to you to nullify it before brining it to the community though. Google owe us nothing, not an explanation not a dime not the time of day. PageRank is a scientific thing so of course there are papers and these were presented to the scientific community at those conferences. Patents are public. PageRank has been shown to be flawed and that early on, there have been significant moves forward by the scientific community, let alone Google.

I do think that the PageRank that you know of is not the PageRank method in place right now. The weightings and such other sensitive information has never been released which is what would give you your answers.

Also PageRank is really just one cog in the machine, there are a lot of very cool things going on with context analysis with is where PageRank obviously fails, as it can’t take that crucial information into account.

Look forward to reading more from you.
Sean Ferguson says:

June 11, 2009 at 9:46 pm

cj,

Thank for your contribution to this post. Your commentary is very helpful. Conceptually, I believe it was what I was trying to get at but not clearly expressing.

A hypothesis, null or otherwise, needs to be testable. If there is no clear experiment that can be conducted that would either support or refute the hypothesis, then it is a tautology and not all that useful in the scientific world.

I’m still eager to hear what such a hypothetical study would look like.
cj says:

June 11, 2009 at 11:16 pm

Hey Sean,

I touched on it in the post I listed above. A phd thesis is a good example of hypothesis testing. You’re supposed to pick a problem, research it, decide on a course of action and test, test, test, test to see if you’re right, and it goes to peer review several times. Sometimes you end up with a theory. Sometimes not. There is always an experiment that you can do, you have to be a bit creative in this case but I’m sure you could come up with something. You always work with what you have and sometimes it’s not much.

These resources might be useful:

http://go.hrw.com/resources/go_sc/ssp/HK1BSW11.PDF

http://stattrek.com/AP-Statistics-4/Hypothesis-Testing.aspx?Tutorial=Stat

http://www2.uta.edu/infosys/baker/STATISTICS/Keller7/Keller%20PP%20slides-7/Chapter11.ppt

Examples here:

http://www.math.uah.edu/STAT/hypothesis/index.xhtml
Pingback: http://bookmark.giorgiotave.it
Pingback: PageRank сметки
jaamit SEO says:

June 16, 2009 at 3:27 am

For anyone who hasn’t already seen it, Matt Cutts has finally clarified Google’s official position on PR Sculpting: http://www.mattcutts.com/blog/pagerank-sculpting/.

Essentially, he confirms that for the past year Google has indeed been letting PR evaporate via nofollow links. Additionally he confirms that it is the case for external links as well as internal, so your hypothesis above would seem to be the case (on a simplified level).

This has some rather serious implications, not on internal pagerank scultping (which is ultimately a minor tweak that only SEOs will really worry about), but on external nofollow linking, particularly on UGC like blog comments. A post with lots of comments containing nofollowed links will have a far more diluted PR to ‘give’ its followed links than a post with no comments at all. Expect to see bloggers switching off comments, or at least switching off links within comments, once this becomes common knowledge.

Make sure you also read Rand’s interpretation of all this over at SEOmoz : http://www.seomoz.org/blog/google-says-yes-you-can-still-sculpt-pagerank-no-you-cant-do-it-with-nofollow
Barry Welford says:

June 16, 2009 at 5:40 am

Thanks for those links, jaamit. As I read all that discussion, it looks as though the simple picture suggested by the Null Hypothesis may be an easy way of understanding the implications of all this.
Michael Martinez says:

June 16, 2009 at 11:26 am

The SEO community once again finds one of its dearly held myths (sculpting PageRank) blown apart by the realization that all their tests and rationalizations have been invalid for a long time.

People need to stop fussing over how to manage PageRank by cutting off the flow. PageRank hoarding never worked anyway, and PageRank Sculpting has just proven to be PageRank Hoarding by another name.

A pile of poop smells just as bad by any other name.
cj says:

June 16, 2009 at 7:44 pm

I’ll agree with Michael here. Also I’ll add that if things were tested properly, the flaws in those ideas would be evident early on.
Barry Welford says:

June 16, 2009 at 8:27 pm

You add a touch of realism to all this, Michael. That is also reflected in an ongoing discussion on this topic at the Cre8asite Forums. It is titled Bubble And Burst, The tragically funny rel=nofollow fiasco and was started by iamlost, one of the Moderators there. It is worth a visit.

On your thought, cj, re testing, my first career was as a mathematical statistician so perhaps I can state with some conviction that it is extremely difficult to try to backward-engineer a system as complex as the Google search process by doing tests. Simple things can easily be tested re the current state of affairs. For example, how many characters does Google currently seem to index in the Title.

For anything more complex, there are so many factors and so much noise in the system, that it is difficult to do comparable tests. Just remember the ongoing tweaking of algorithms by the Google engineers and the fact that you may get results from any one of the data centers, which may or may not be synchronized with the one you hit the last time. In addition there may be detection of your user-agent that may mean that you get different results from someone else.

In short in trying to spot subtle and complex effects like changing nofollow tags on blogs, which themselves have complex structures, you are facing insuperable challenges IMHO.
cj says:

June 16, 2009 at 8:47 pm

I’m not talking about reverse engineering anything, but stating things like “Oh Google does this” requires a little more than having seen it on 2 different sites for example (not that you did!).

I think that the most effective thing anyone in SEO can do is know what’s happening in that area of research, understand it, and then see if you can find evidence of it. You might, you might not.

How does Google PageRank work? Not PageRank, Google PageRank? PageRank is a pretty common algorithm now and all sorts of systems use it and its variants. You can’t know what makes up Google PageRank let alone how it works. You can be aware of how the method has developed through the years though.

My feeling is that sometimes SEO looks way too low level for its own good. I agree wholeheartedly with your last statement.
Web Templates says:

June 22, 2009 at 3:37 pm

I think too many people obsess about PageRank they should focus on ranking well on important terms rather than the arbitrary number in the grey bar.
jaamit SEO says:

June 23, 2009 at 12:34 am

To be fair this discussion is talking about ‘actual’ Pagerank rather than the Toolbar Pagerank, which as you say can be rather arbritary and not necessarily related to good rankings. ‘Actual’ PR is still a crucial to the side of google’s algorithm that measures relative importance of webpages (the other side being relevance). Although PageRank has undoubtedly evolved since the initial PageRank paper, the principle that each page has a certain amount of PageRank to distribute amongst its links, still holds true, and therefore this discussion is still useful.

On the subject of Toolbar PR, has anyone else noticed that while Google only used to update it every few months (one of its main flaws), in the past few months they seem to be regularly updating it, which IMO makes it a far more accurate indicator of importance than previously…