Search Engine Optimization (SEO) for blogs is often not done effectively and posts rank below where they should be in keyword searches. One particular problem can be hanging/dangling web pages created by the blogging software coupled with inappropriate use of robots.txt files and tags. Such hanging web pages can act as sinks or black holes for PageRank, a key factor in the Google search algorithm. This article provides a simple explanation of the issues involved and appropriate solutions.
"You are creating hanging/dangling pages", wrote Andy Beard in a recent comment on a post on Avoiding WordPress Duplicate Content. After an e-mail exchange, I could understand his concern. It is a potential problem that robots.txt files could create. As Andy wrote some time back, it is one of the SEO Linking Gotchas Even The Pros Make.
More recently, Rand Fishkin has pointed out that you should not Accidentally Block Link Juice with Robots.txt. Rand advised doing the following:
- Conserve link juice by using nofollow when linking to a URL that is robots.txt disallowed
- If you know that disallowed pages have acquired link juice (particularly from external links), consider using meta noindex, follow instead so they can pass their link juice on to places on your site that need it.
Link juice is just another term for PageRank. This PageRank value for any web page is an important element in how well it will rank in any keyword search. It may be one of over 100 factors but it probably is the most important in the Google keyword search process. Avoiding losing PageRank that a web page could amass is an important function that SEOs should pursue.
After doing some research, it turns out to be a somewhat more complex issue requiring an understanding of some weighty articles. Anyone involved in doing SEO or hiring an SEO consultant should be aware of the potential problem to ensure things are done correctly. I also realized that there was no simple explanation of the issues so this post will attempt to rectify that omission.
Research on Hanging / Dangling Web Pages
If you want to do some of your own research, before checking out the later explanations, I found the following useful:
- Dangling Pages – WebProWorld SEO Forum
- What Do SEO/SEM People Put In Robots.txt Files? – Shaun Anderson
- WordPress robots.txt SEO – AskApache Web Development
- Internal Linking – META nofollow, rel nofollow, robots.txt Confusion thereon – Josh Spaulding
Of course with search engine algorithms, things are always in evolution. The official word on the Google website gives the following information on rel="nofollow".
How does Google handle nofollowed links?
We don’t follow them. This means that Google does not transfer PageRank or anchor text across these links. Essentially, using nofollow causes us to drop the target links from our overall graph of the web. However, the target pages may still appear in our index if other sites link to them without using nofollow, or if the URLs are submitted to Google in a Sitemap. Also, it’s important to note that other search engines may handle nofollow in slightly different ways.
That lead to the practice of PageRank sculpting, whereby people try to manage how PageRank is distributed among the web pages in a website. More recently Matt Cutts of Google in a Q&A session at SMX Advanced 2009 in Seattle, WA, provided the current thinking on nofollow as recorded by Lisa Barone:
Q: It seems like you supported PageRank sculpting a year ago and now it seems like you don’t support it anymore. Why is that and will it become a negative indicator?
A: No, it won’t hurt your site. You can do your links however you want. You can use it to eliminate links to sign in forms and whatnot, but it is a better use of your time to fix your site architecture and fix the problem from the core. Suppose you have 10 links and 5 of them are nofollowed. There is this assumption that the other 5 links get ALL that PageRank and that may not be as true anymore (your leftover PageRank will now “evaporate”, says Matt.). You can’t shunt your PageRank where you want it to go. It’s not a penalty. It’s not going to get you in trouble. However, it’s not as effective. It’s a better use of your time to go make new content and do all the other things. If you’re using nofollow to change how PageRank flows, it’s like a band-aid. It’s better to build your site how you want PageRank to flow from the beginning.
Let us now try to pull all that together in a short number of simple explanations covering the important issues involved.
How PageRank is calculated
Google is not always completely open on what is involved in its search algorithms for obvious reasons. The algorithms also evolve as the Q&A quote above shows. The following is a best judgment on what is involved, but if anyone has corrections or modifications to what is shown, they are encouraged to add a comment.
The following diagram illustrates how PageRank is calculated for any web page and how fractions of the PageRank flow to and from linked web pages. PageRank here is not the value that appears in the ‘thermometer’ in the Google Toolbar, and which goes from 0 to 10. Instead this PageRank is the mathematical value used in the Google keyword search algorithm. It is calculated for any web page and represents the probability that a random visitor would visit the given web page as opposed to visiting other web pages.
Here we have multiplied this mathematical value by a huge multiplier to give values that are easier to talk about. We will use the term, PageRank factor, for this derived number. The resulting number would normally be a value like 5.6 or 16.2 but here we have simplified yet again to round off to whole numbers. This illustrates a typical web page (but with very few links). Some links are external links involving other web pages on other websites (domains). Some are internal links from web pages on the same website (domain). The inlinks are hyperlinks on other web pages leading to this web page. The outlinks are hyperlinks on the given web page to other web pages.
What the image illustrates is that the PageRank factor of this web page (16) is determined by the sum of the PageRank factor contributions flowing through the inlinks. This PageRank factor then flows out via the 4 outlinks with an equal PageRank factor contribution (4) on each link.
You can imagine this particular web page as being only one among the whole set of web pages on the Internet. For the technically inclined, we should mention that these PageRank values all are interdependent so they are developed by a process of iteration starting with starting values and repeatedly recalculating to determine what the values are. That goes beyond the scope of this article.
How a robots.txt file changes the picture
If a robots.txt file disallows this web page for crawl visits by the search engine spiders, then provided they obey the robots.txt file, they would record the values and links shown in this image. These PageRank values are the same, whether or not the web page is blocked to crawlers by the robots.txt file. The record is indexed because there is an external inlink that the Google robots are crawling and they would also note the outlink going to another domain. The outlinks to other web pages on the same domain (internal links) would not be recorded so these PageRank contributions are lost. In this sense the web page has become a sink or black hole for these PageRank contributions. They can no longer contribute to the PageRank of these other web pages.
Note that the PageRank factor values on the remaining links are the same as they were when the other links were being included. Merely saying the links should not be crawled, does not necessarily mean they should be assumed not to exist. This is in line with Matt Cutt’s most recent pronouncements.
How nofollow changes the calculation
Even if this web page was not excluded by a robots.txt file, a similar effect is created if all outlinks from the web page carry an attribute, rel=nofollow. Again this assumes that the search engine correctly observes this attribute. If on the other hand the links are assigned a follow attribute, then the PageRank contribution would flow through to all such links.
How to get only one web page that counts for any specific content
As Rand Fishkin suggested above, if more than one web page contains the same content, you can use a meta tag on all the secondary ones to signal noindex. Then only the primary web page is in the search database, provided the meta tags are being observed. Coupling this with a follow attribute in the meta tag, then assures that the PageRank contributions still flow out to the other web pages.
We now support a format that allows you to publicly specify your preferred version of a URL. If your site has identical or vastly similar content that’s accessible through multiple URLs, this format provides you with more control over the URL returned in search results. It also helps to make sure that properties such as link popularity are consolidated to your preferred version.
Apparently Google treats this as a hint rather than a standard so it is not fool-proof. Others see Reasons to use rel=canonical, and reasons not to.
As Matt Cutts recommended, given the wooliness in some of the above, the preferred approach is to develop the website architecture so that duplicate web pages do not arise. Then one does not have to rely on the canonical tag or the noindex follow combination. In this way one avoids the hanging / dangling web pages problem.
The exact methods will depend on the architecture. One very useful approach is to show only an initial excerpt on the blog Home Page with a … more link to the full post as a single web page. For category or tag archive pages, you can show only the titles of items so this again avoids the duplicate content problem. The important thing is to be vigilant and look out for essentially duplicate web pages as revealed by a full website scan using the equivalent of a search engine robot such as Xenu.