3 How Search Engines Work
History of Search Engines
The first mechanised information retrieval sysyems were built by the US military to analyse the mass of documents being captured from the Germans. Research was boosted when the UK and US governments funded research to reduce a perceived “science gap” with the USSR. By the time the internet was becoming commonplace in the early 1990s information retrieval was at an advanced stage. Complicated methods, primarily statistical, had been developed an archives of thousands of documents could be searched in seconds.
Web search engines are a special case of information retrieval systems, applied to the massive collection of documents available on the internet. A typical search engine in 1990 was split into two parts: a web spider that traverses the web following links and creating a local index of the pages, then traditional information retrieval methods to search the index for pages relevant to the users query and order the pages by some ranking function. Many factors influence a person’s decision about what is relevant, such as the current task, context and freshness.
In 1998 pages were primarily ranked by their contextual content. Since this is entirely controlled by the owner of the page, results were easy to manipulate and as the Internet became ever more commercialized the noise from spam in SERP’s (search engine results pages) made search a frustrating activity. It was also hard to discern websites which more people would want to visit, for example a celebrities official home page, from less wanted websites with similar content, for example a site. For these reasons directory sites such as Yahoo were still popular, despite being out of date and making the user work out the relevance
Google’s founders Larry Page and Sergey Brin’s Page Rank innovation (named after Larry Page), and that of a similar algorithm also released in 1998 called Hyperlink-induced Topic Search (HITS) by Jon Kleinberg, was to use the additional meta information from the link structure of the Internet. A more detailed description of Page Rank will follow in [chapter], but for now Google’s own description will suffice:
PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page’s value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves “important” weight more heavily and help to make other pages “important”.
Whilst it is impossible to know how Google has evolved their algorithms since the 1998 paper that launched page rank, and how real world efficient implementation differs from the theory, as Google themselves say the PageRank algorithm remains “the heart of Google’s software … and continues to provide the basis for all of [their] web search tools”. The search engines continue to evolve at a blistering pace, improving their ranking algorithms (Google says there are now over 200 ranking factors considered for each search), and indexing a growing Internet more rapidly.
The building of a system as complex as a modern search engine is all about balancing different positive qualities. For example, you could effectively prevent low quality spam by paying humans to review every document on the web, but the cost would be immense. Or you could speed up your search engine by considering only every other document your spider encounters, but the relevance of results would suffer. Some things, such as getting a computer to analyse a document to with the same quality as a human, are theoretically impossible today, but Google in particular is pushing boundaries and getting ever closer.
Search engines have some particular considerations:
The response time to a user’s query must be lightening fast.
Unlike a traditional information retrieval system in a library the pages on the Internet are constantly changing.
Search engines need to work with billions of users searching through trillions of documents, distributed across the Earth.
Spam and Manipulation
Actively engaging against other humans to maintain the relevancy of results is relatively unique to search engines. In a library system you may have an author that creates a long title packed with words their readers may be interested in, but that’s about the worst of it. When designing your search engine you are in a constant battle with adversaries who will attempt to reverse engineer your algorithm to find the easiest ways to affect your restyles. A common term for this relation ship is “Adverserial Information Retrieval”. The relationship between the owner of a Web site trying to rank high on a search engine and the search engine designer is an adversarial relationship in a zero-sum game. That is, assuming the results were better before, every gain for the web site owner is a loss for the search engine designer. Classifying where your efforts cross helping a search engine be aware of your web site’s content and popularity, which should help to improve a search engine’s results, and start instead ranking beyond your means and start decreasing the quality of a search engine’s results can be somewhat tricky. The practicalities of what search engines consider to be spam, and as importantly what they can detect and fix, will be discussed later.
According to “Web Spam Taxonomy“, approximately 10-15% of indexed content on the web is spam. What is considered spam and duplicate content varies, which makes this statistic hard to verify. There is a core of about 56 million pages that are highly interlinked at the center of the Internet, and are less likely to be spam. Document’s further away (in link steps) from this core are more likely be spam.
Deciding the quality of a document well (say whether it is a page written by an expert in the field, or generated by a computer program using natural language processing) is an AI Complete problem, that is it won’t be possible until we have artificial intelligence that can match that of a human.
However, search engines hope to get spam under control by lessening the financial incentive of spam. This quote from a Microsoft Research paper expresses this nicely:
“Effectively detecting web spam is essentially an arms race between search engines and site operators. It is almost certain that we will have to adapt our methods overtime, to accommodate for new spam methods that the spammers use. It is our hope that our work will help the users enjoy a better search experience on the web.Victory does not require perfection, just a rate of detection that alters the economic balance for a would-be spammer. It is our hope that continued research on this front can make effective spam more expensive than genuine content.”
Google developers for their part describe web spam as the following, citing the detrimental impact it has upon users:
“These manipulated documents can be referred to as spam. When a user receives a manipulated document in the search results and clicks on the link to go to the manipulated document, the document is very often an advertisement for goods or services unrelated to the search query or a pornography website or the manipulated document automatically forwards the user on to a website unrelated to the user’s query.”
How a Search Engine works
A typical search engine can split into two parts: Indexing, where the Internet is transformed into an internal representation that can be efficiently searched. The query process, where the index is searched for the user query and documents are ranked and returned to the user in a list.
A crawler starts at a seed site such as the DMOZ directory, then repeatedly follows links to find documents across the web, storing the content of the pages and associated meta data (such as the date of indexing, which page linked to the site). In a modern search engine the crawler is constantly running, downloading thousands of pages simultaneously, to continuously update and expand the index. A good crawler will cover a large percentage of the pages on the Internet, and visit popular pages frequently to keep its index fresh. A crawler will connect to the web server and use a HTTP request to retrieve the document, if it has changed. On average, Web page updates follow the Poisson distribution – that is the crawler can expect the time until the web page updates next time to follow an exponential distribution. Crawlers are now also indexing near real time data through varying sources such as access to RSS Feeds and the Twitter API, and are able to index a range of formats such as PDF’s and Flash. These formats are converted into a common intermediate format such as XML. A crawler can also be asked to update its copy of a page via methods such as a ping or XML sitemap, but the update time will still be up to the crawler. The document data store stores the text and meta data the crawler retrieves, it must allow for very fast access to a large amount of documents. Text can be compressed relatively easily, and pages are typically indexed by a hash of their URL. Google’s original patent used a system called BigTable, Google now keeps documents in sections called shards distributed over a range of data centres (this offers performance, redundancy and security benefits).
Duplicate Content Detection
Detecting exact duplicates is easy, remove the boilerplate content (menus etc.) then compare the core text through check sums. Detecting near duplicates is harder, particularly if you want to build an algorithm that is fast enough to compare a document against every other document in the index. To perform faster duplicate detection, finger prints of a document are taken.
A simple fingerprinting algorithm for this is outlined here:
1. Parse the document into words, and remove formatting content such as punctuation and HTML tags.
2. The words are grouped into groups of words (called n-grams, a 3-gram being 3 words, 4-gram 4 words etc.)
3. Some of these n-grams are selected to represent a document
4. The selected n-grams are hashed to create a shorter description
5. The hash values are stored in a quick look up database
6. The documents are compared by looking at overlaps of fingerprints.
A paper by four Google employees found the following statistics across their index of the web.
Number of tokens: 1,024,908,267,229
Number of sentences: 95,119,665,584
Number of unigrams: 13,588,391
Number of bigrams: 314,843,401
Number of trigrams: 977,069,902
Number of fourgrams: 1,313,818,354
Number of fivegrams: 1,176,470,663
Most common trigram in English: “all rights reserved”
Detecting unusual patterns of n-grams can also be used to detect low quality/spam documents.
Tokenization is the process of splitting a series of characters up into separate words. These tokens are then parsed to look for tokens such as <a ></a> to find which parts of the text is plain text, links and such.
Sections of documents that are just content are found, in an attempt to ignore “boiler plate” content such as navigation menus. A simple way is to look for sections where there are few HTML tags, more complicated methods consider the visual layout of the page.
Common words such as “the” and “and” are removed to increase the efficiency of the search engine, resulting in a slight loss in accuracy. In general, the more unusual a word the better it is at determining if a document is relevant.
Stemming reduces words to just their stem, for example “computer” and “computing” become “comput”. Typically around a 10% improvement is seen in relevance in English, and up to 50% in Arabic. The classic stemmer algorithm is the “Porter Stemmer” which works through a series of rules such as “replace sses with ss to stresses -> stress”.
Trying to determine the meaning of text is very difficult in general, but certain words can give clues. For example the phrase “x has worked at y” is useful when building an index of employees.
Document statistics such as the count of words are stored for use in ranking algorithms. An inverted index is created to allow for fast full text searches. The index is distributed across multiple data centres across the globe.
The user is provided with an interface in which to give their query. The query is then transformed, using similar techniques to with documents such as stemming, as well as spell checking and expanding the query to find other queries synonymous with the users query. After ranking the document set, a top set of results are displayed together with snippets to show how they were matched.
A scoring function calculates scores for documents. Some parts of the scoring can be performed at query time, others at document processing time.
Users queries and their actions are logged in detail for improve results. For example, if a user clicks on a result then quickly performs the same search again, it is likely that they clicked a poor result.