4 Search Engine Ranking Factors
Google engineers update their algorithms daily. They then run many tests to check they have the right balance between all these factors. The following is from an interview with Google’s Udi Manber:
Q: How do you determine that a change actually improves a set of results?
A: We ran over 5,000 experiments last year. Probably 10 experiments for every successful launch. We launch on the order of 100 to 120 a quarter. We have dozens of people working just on the measurement part. We have statisticians who know how to analyze data, we have engineers to build the tools. We have at least 5 or 10 tools where I can go and see here are 5 bad things that happened. Like this particular query got bad results because it didn’t find something or the pages were slow or we didn’t get some spell correction.
I have created a spreadsheet that shows how a search engine may calculate the ranking of a trivial set of documents for a particular query, you can view it and try changing things yourself at the Poodle Search Engine Emulation.
On Page Factors
• Keywords
Repetitions of the words in the query in the document, particularly in key areas such as the title and headers are positive signals of relevance. The proximity of the words together is important, particularly having the exact query in the document. A very large repetition, particularly in nongrammatical sentences, can be a negative signal of spam. Presence of the query words in the Domain and URL are useful signals of relevance. Related phrases to the query are also positive signals of relevance (see Latent Semantic Indexing). The meta keywords HTML tag, <meta name=”keywords” content=”my, keywords”>, is largely ignored by modern search engines.
Quality
A number of different authors on a website, good grammar, spelling and long pages written at reasonable time intervals are positive signs of high quality content.
Geographical Locality
Mentions of an address close the user show the document may be geographically relevant to the user, particularly for geograpihcally sensitive queries such as “plumbers in london”.
Freshness
For time dependant queries, such as news events, recent pages are more likely to be helpful to the user. See Google’s “Quality Deserves Freshness” drive, of which Google’s faster indexing Caffeine update was a part.
Duplicate Content
Large percentages of content duplicated either from the same site, or others is an indicator of poor quality content and users will only want to see the canonical copy.
Adverts
A very large number of adverts can reduce the user experience, and affiliate links are often associated with heavily SEO manipulated websites.
Outbound Links
Links to spammy of phising websites, or an unusually large number of outbound links on a number of pages, are common indicators of a page that users will not want to visit (See See “Improving Web Spam Classi?ers Using Link Structure” for a very interesting Yahoo patent on detecting spam based on the number of inbound and outbound links).
Spam
An unusual repetition of keywords, particularly outside of sentences is a sign of spam. Techniques such as hidden text and sneaky javascript redirects are relatively easy to detect and punish.
Off Page Factors
Site Reliability
Unreliable or slow sites provide a poor user experience, and so will have a penalty applied. You can be warned if this happens if you sign up for Google webmaster tools.
Popularity of the Site
From aggregated ISP data that search engine’s buy and search traffic. For example see Compete.com and Google Trends.
Incoming Links/ PageRank
The link structure of the internet is a useful pointer of a websites popularity. Anchor text on incoming links related to query shows a search engine the page is related to the query. Links they remain for a long time from sites that have many links pointing to themselves are rated highly. Links that are in boiler plate areas or sitewide may be ignored. Links that are all identical in anchor text (ie blatantly machine generated), from spammy websites (“bad neighborhoods”), thought to be paid for with the intention of manipulating rankings or spam can result in penalties. Links from sites that are most likely owned by the same owner, detected either from Whois data or if the sites are hosted within the same Class C IP, are likely considered less reliable signals of importance. A normal rate of growth of incoming links, as opposed to bursty start stops th
at indicate link building campaigns .
at indicate link building campaigns .
Other indirect signals of a website’s popularity
Other data can include mentions in chats, emails and social networks.
Links from trusted websites
The proximity on web graph to important, trusted sites (Links from old, high page rank websites at the centre of the old heavily interconnected internet are useful signals that a website can be trusted and is important ). See the Touchgraph Network Visualizer and type in http://www.nasa.gov for a visual graph
Links from other sites that rank for the query
Results may be reordered based on how they link to each other.
Geographical Location
If the geographical location of server, website according to directories, top level domain or location as set in Google Webmaster Tools match that of the user it is a signal that the page will be more relevant to the user, particularly for location sensitive searches.
User Click Data
If users often search again after clicking on the sites result that is an indicator that the page is not a good match for the query. The personal history of results clicked, and pattern of related searches may help indicate what a user is looking for.
Domain Information
Older domains are likely trusted more. Google is a domain registrar so has extensive information Whois Information, and validates that address information associated with domains is correct.
Manul Reviews
Google Quality Raters manually reviewing websites and tagging them as categories such as “essential to query”, “not relevant to query”, “spam”.
Google PageRank Notes
Google’s PageRank was the innovation that propelled Google to the top of the search engine pile. Whilst its implementation has changed much since its original description, and many other factors are now taken into account, it is still at the heart of modern search engines so some extra notes will be made on it here.
Short Description
The key point is that PageRank considers each link a vote, and links from pages which have many links themselves are considered more important. Or as Google puts it:
“PageRank reflects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results. PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value.”
Mathematical Description
Its not essential to have a mathematical understanding of how PageRank is calculated, but for those familiar with basic graph theory and algebra it is useful. You may wish to skip this section, and read a slightly less mathematical description here, here or the Wikipedia article. For a more complete treatment of the mathematics see the original PageRank paper, the “Deeper Inside PageRank” by Amy N. Langvilleand and Carl D, and this thesis. The following is summarised from “Sketching Landscapes of Page Farms” by Bin Zhou and Jian Pei:
The Web can be modeled as a directed Web graph G = (V, E), where V is the set of Web pages, and E is the set of hyperlinks. A link from page p to page q is denoted by edge p -> q. An edge p -> q can also be writte nas a tuple (p, q).
PageRank measues the importance of a page p by considering how collectively other Web pages point to p directly or indirectly. Formally, for a Web page p, the PageRank score is defined as
Where M(p) = { q| q -> p ϵ E } is the set of pages having a hyperlink point to p, OutDeg(pi) is the out-degree of pi (i.e., the number of hyperlinks from pi pointing to some pages other than pi), and d is a damping factor (0.85 in the original PageRank implementation) which models the random transitions of the web. If a damping factor of 0.5 is used then at each page there is a 50/50 chance of the surfer clicking a link, or jumping to a random page on the internet. Without the damping factor the PageRank of any page with an outgoing link would be 0.
To calculate the PageRank scroes for all pages in a graph, one can assign a random PageRank score value to each node in the graph, then apply the above equation iteratively until the PageRank scroes in the graph converge.
The google toolbar is a logarithmic scale out of 10, not the actual internal data. For example:
| Domain | Calculated PageRank | PageRank displayed in Toolbar |
| small.com | 47 | 2 |
| medium1.com | 54093 | 5 |
| medium2.com | 84063 | 5 |
| big.com | 1234567 | 7 |
| big2.com | 2364854 | 7 |
Interesting Notes on the Original Implementation of PageRank
From “PageRank Uncovered”, essential reading for those looking to understand PageRank from an SEO perspective:
•PageRank is a multiplier, applied after relevant results are found
“Remember, PageRank alone cannot get you high rankings. We’ve mentioned before that PageRank is a multiplier; so if your score for all other factors is 0 andyour PageRank is twenty billion, then you still score 0 (last in the results). This isnot to say PageRank is worthless, but there is some confusion over when PageRank is useful and when it is not. This leads to many misinterpretations of its worth. The only way to clear up these misinterpretations is to point out when PageRank is not worth while.If you perform any broad search on Google, it will appear as if you’ve found several thousand results. However, you can only view the first 1000 of them. Understanding why this is so, explains why you should always concentrate on “on the page” factors and anchor text first, and PageRank last.”
• Each page is born with a small amount of PageRank
A page that is in the Google index has a vote, however small. Thus, the more pages you have in the index – the more overall vote you are likely to have. Or,simply put, bigger sites tend to hold a greater total amount of PageRank within their site (as they have more pages to work with).
Note that Google’s original algorithm has most likely been amended since to detect and reduce page rank hoarding, and generating PageRank by massive interlinking on auto generated pages. Also for quicker calculations an approximation of PageRank which only gives certain seed pages PageRank may be used.
Interestingly, however, there are examples of this working, see “How to get billions of pages indexed in Google”. In a related issue, at one point 10% of MSN Search’s (now known as Bing) German index was computer generated content on a single domain.
Optimal Linking Strategies
Deciding how to interlink pages that you own or have influence over is tricky; interlinking can be a good signal that that pages are related and on a certain topic, build PageRank and control PageRank flow. However, heavily interlinking can be a signal of manipulation and spam, and different linking structures can make different sites in your possession rank higher. The mathematics gets tricky fast, here is a quick overview of the literature today:
• Note from “Web Spam Taxonomy”
Though written about Spam farms, the math holds true for good commercial sites too. Essentially this states that maximum page rank for a target page is achieved by linking only to the target page from forums, blogs etc. then interlinking the network of sites owned (as if there are no outlinks on a page the “random surfer” will jump to a random page on the Internet).
1. Inaccessible pages are those that a spammer cannot modify. These are the pages out of reach; the spammer cannot in?uence their outgoing links. (Note that a spammer can still point to inaccessible pages.)
2. Accessible pages are maintained by others (presumably not a?liated with the spammer), but can still be modi?ed in a limited way by a spammer. For example, a spammer may be able to post a comment to a blog entry, and that comment may contain a link to a spam site.
3. Own pages are maintained by the spammer, who thus has full control over their contents.
We can observe how the presented structure maximizes the total PageRank score of the spam farm, and of page t in particular:
1. All available n own pages are part of the spam farm, maximizing the static score PRstatic (S).
2. All m accessible pages point to the spam farm, maximizing the incoming score PRin (S).
3. Links pointing outside the spam farm are suppressed, making PRout (S) equal to zero.
4. All pages within the farm have some outgoing links, rendering a zero PRsink (S) score component.
Within the spam farm, the the score of page t is maximal because:
1. All accessible and own pages point directly to the target, maximizing its incoming score PRin (t).
2. The target points to all other own pages. Without such links, t would had lost a signi?cant part of its score (PRsink (t) > 0), and the own pages would had been unreachable from outside the spam farm. Note that it would not be wise to add links from the target to pages outside the farm, as those would decrease the total PageRank of the spam farm.
• From “Link Spam Alliances”
The analysis that we have presented show how the PageRank of target pages can be maximized in spam farms. Most importantly, we find that there is an entire class of farm structures that yield the largest achievable target PageRank score. All such optimal farm structures share the following properties:
• All boosting pages point to and only to the target.
• All hijacked point to the target.
• There are some links from the target to one or more boosting pages.
• From “Maximizing PageRank via Outlinks”
In this paper we provide the general shape of an optimal link structure for a website in order to maximize its PageRank. This structure with a forward chain and every possible backward link may be not intuitive. At our knowledge, it has never been mentioned, while topologies like a clique, a ring or a star are considered in the literature on collusion and alliance between pages. Moreover, this optimal structure gives new insight into the affirmation of Bianchini et al. that, in order to maximize the PageRank of a website, hyperlinks to the rest of the webgraph should be in pages with a small PageRank and that have many internal hyperlinks. More precisely, we have seen that the leaking pages must be chosen with respect to the mean number of visits before zapping they give to the website, rather than their PageRank.
• From “The effect of New Links on PageRank” by Xie
Theorem: The optimal linking strategy for a Web page is to have only one outgoing link pointing to a Web page with a shortest mean first passage time back to the original page.
Conclusions: …. We conclude that having no outgoing link is a bad policy and that the best policy is to link to pages from the same Web community. Surprisingly, a new incoming link might not be good news if a page that points to us gives many other irrelevant links at the same time.
Reading this paper fully it is only in very particular circumstances that a new incoming link is not good news.
Implementation to make computing PageRank faster
There have been a number of proposed improvements to the original PageRank algorithm to improve the speed of calculation, and to adapt it to be better at determining quality results. No search engine calculates PageRank as shown in the naive algorithm in the original paper. For example, see “Computing PageRank using Power Extrapolation”, “Efficient PageRank Approximation via Graph Aggregation”, and Matt Cutt’s brief notes on PageRank implementation.
HITS
HITS is another ranking algorithm that takes into account the pattern of links found throughout the web, and it was released just before PageRank in 1999. HITS treats some pages on the web as authorities, which are good documents on a topic, and hubs, which mostly link to authorities.
A page is given a high authority score by being linked to by pages that are recognized as Hubs for information. A page is given a high hub score by linking to nodes that are considered to be authorities on the subject.
Unlike PageRank, which is query independent and so computed at indexing time, HITS hub and author scores are query depend ant and so computed (though likely cached) at query time.
Is linking out a good thing?
Whilst TEOMA is the only search engine that uses HITS at its core, its thinking has heavily influenced search engine designers – so it is likely that linking out to high quality authorities can positively influence either a pages ranking (though potentially negatively, if designers want authorities rather than hubs to appear in their results – see here and “Deeper Inside PageRank”), or the importance of the other links it contains. Many webmasters fear linking out to sites as they would rather keep links internal to prevent PageRank “flowing out” (many webmasters also nofollow links to similar reasons, not that this form of PageRank sculpting no longer works according to Matt Cutts, Google’s head of [anti]web spam).
Matt Cutts also said a number of years ago:
“Of course, folks never know when we’re going to adjust our scoring. It’s pretty easy to spot domains that are hoarding PageRank; that can be just another factor in scoring.”
Some search engines are even concerned about people linking out too much, whilst crawlers can now index a large number of links on a page, a very large number of outbound links often indicates that a site has been hacked with spam links or is machine generated.
“A spammer might manually add a number of outgoing links to well-known pages, hoping to increase the page’s hub score. At the same time,the most wide-spread method for creating a massive number of outgoing links is directory cloning”.
Linking out has some added benefits, such as the likelihood of a link back either automatically through trackbacks or manually if the webmaster follows the referral logs and likes your site.
TrustRank / Bad Page Rank
Its likely that after results are generated based on relevance, PageRank is then applied to help order, then Trust Rank to help order the results. A site may lose trust every time it fails some kind of spam test (for example if a large number of reciprocal links are found,cloaking, duplicate content, fake whois data) and gain Trust for certain properties (domain age, traffic, being one a number of important “seed” sites that are manually tagged as trusted sites). These initial Trust Ranks could then be propagated in a similar way to PageRank, so linking to and from “bad neighborhoods” would negatively affect the sites Trust Rank through association. For the math, see here, here and here.
From SEO By The Sea:
In 2004, a Yahoo whitepaper was published which described how the search engine might attempt to identify web spam by looking at how different pages linked to each other. That paper was mistakenly attributed to Google by a large number of people, most likely because Google was in the process of trademarking the term TrustRank around the same time, but for different reasons. Surprisingly, Google was granted a patent on something it referred to as Trust Rank in 2009, though the concept behind it was different than Yahoo’s description of TrustRank. Instead of looking at the ways that different sites linked to each other, Google’s Trust Rank works to have pages ranked according to a measure of the trust associated with entities that have provided labels for the documents.
…
If you’ve ever heard or seen the phrase “TrustRank” before, it’s possible that whoever was writing about it, or referring to it was discussing a paper titled Combating Web Spam with TrustRank (pdf). While the paper was the joint work of researchers from Stanford University and Yahoo!, many writers have attributed it to Google since its publication date in 2004 The confusion over who came up with the idea of TrustRank wasn’t helped by Google trademarking the term “TrustRank” in 2005. That trademark was abandoned by Google on February 29, 2008, according to the records at the US PTO Tess database. However, a patent called “Search result ranking based on trust” deals with something called trust rank, filed on May 9, 2006.
Google mentions distrust and trust changes as indicators. More than trust analysis, trust variation analysis is on the road. Fake reviews, sponsored blogs and e-commerce trust network influence are pointed out.
The paper “A Cautious Surfer for PageRank” comments on why TrustRank shouldn’t be overused:
“However, the goal of a search engine is to find good quality results; spam-free is a necessary but not sufficient condition for high quality. If we use a trust-based algorithm alone to simply replace PageRank for ranking purposes, some good quality pages will be unfairly demoted and replaced, for example, by pages within the trusted seed sets, even though they may be much less authoritative.Considered from another angle, such trust-based algorithms propagate trust through paths originating from the seed set; as a result,some good quality pages may get low value if they are not well connected to those seeds.”
Improvements to Google’s ranking algorithms
There have been a number of notable algorithm changes which made considerable changes appear to results pages, though often the effects were later scaled back slightly.
• Increasing use of anchor text
Even the original PageRank algorithm took into account the anchor text of links, so links were used to give both a number that indicated the sites popularity and information about the content of a document and so its relevance for user queries.
• Florida, November 2003
Results for highly commercial queries, likely informed from the cost of Adwords, became heavily filtered so more trusted academic websites and less commercial optimised websites ranked. Some of these changes resulted in less relevance, for example if a user was searching for “buy bricks” they probably didn’t want to mainly see websites about the process of creating bricks, and were rolled back. For more see here and here.
- NoFollow, January 2005
Matt Cutts and Jason Shellen created the nofollow specification to help limit the effect and incentive for blog spam. If a search engine comes across a link tagged as nofollow, it will not treat the link as a vote, ie as a positive signal in rankings. Areas where untrusted users can post content are often tagged nofollow, roughly 80% of content management systems (the software that websites run on) implement nofollow.
Here is some example HTML code of a nofollow link:
<a href="signin.php"rel="nofollow">sign in</a>
• Bourbon, June 2005
A penalty was applied to sites with unusually fast or bursty patterns of link growth.
• Jagger, October 2005
A penalty applied to sites with unusually large amounts of reciprocal links, new methods for detecting hidden text.
• Big Daddy, December 2005
According to Matt Cutts, punished were “sites where our algorithms had very low trust in the inlinks or the outlinks of that site. Examples that might cause that include excessive reciprocal links, linking to spammy neighborhoods on the web, or link buying/selling.”
• Google Bombing Prevention, 2nd February 2007
Google Bombing is the process of massively linking to a page with a specific anchor text, to give PageRank but more importantly indications that the document is related to the anchor text. For example, in 1999 a number of bloggers grouped together to link to Microsoft.com with the anchor text “more evil than Satan himself”. This resulted in Microsoft being placed number one in searches for “more evil than Satan himself” despite not having the phrase anywhere on its page. Detecting a sudden influx of links with identical anchor text is very easy, and in 2007 Google changed their indexing structure so that Google bombs such as “miserable failure” would “typically return commentary, discussions, and articles” about the tactic itself. Matt Cutts said the Google bombs had not “been a very high priority for us. Over time, we’ve seen mroe people assume that they are Google’s opinion, or that Google has hand-coded th
e results for these Google-bombed queries. That’s not true, and it seemed like it was worth trying to correct that perception.” Some Google bombs still work, particularly those targetting unusual phrases, with varied anchor text, over a period of time, within paragraphs of text.
e results for these Google-bombed queries. That’s not true, and it seemed like it was worth trying to correct that perception.” Some Google bombs still work, particularly those targetting unusual phrases, with varied anchor text, over a period of time, within paragraphs of text.
• Caffeine, October 2010
A faster indexing system that changed results little, but allowed for fresher results and some of the later Panda updates.
• Panda, April 2011



