Spam in Poisoned World Cup Results
After staying up four in the morning to watch a terrible draw between America and England, I didn’t feel like repeating the experience when England’s next match against Algeria was scheduled for the same time.
So, waking fresh the next day, I searched on Google to find the result and this is what I saw:

Whilst the result was again disappointing, its great to see Google doing a good job of realising what my intention was, and displaying the England-Algeria result in a lovely little box.
What they didn’t do a good job of though was keeping the spam out of this popular search term.
One below FIFA.com are two suspicious looking results.
In fact, out of the top 20 results, there are 3 domains pushing spammy content which share the following properties:
Urls of the format ….php?x=y
Non sensical text interlaced with the names of countries playing in the World cup and betting keywords
Educational domains
All pages have now been removed (perhaps after prompting from a company such as Google)
There were links on each page to two other pages on each site
It seems clear that these sites were hacked (based on the fact that all pages have now been removed, and that the sites hosting the documents appear to be trusted educational institutions) by the same person .
I couldn’t locate any outgoing links or hijack code in the HTML source however, so perhaps the spammers were waiting before taking advantage of their traffic.
Also as the pages had been taken down I couldn’t see if the pages were randomly generated on the fly, or if they stayed the same on each access.
(Say by using a hash of the url as the seed for the random joining together of text. These pages are harder to detect as spam. For example Microsoft only realised that 22% of their German Index was spam from one individual when they noticed that large amounts of pages were completely changing everyday).
I was surprised to see how well these pages ranked, but there are some useful things to be learnt from this:
Detecting duplicate content is hard (Atleast O(n2) comparisons are required to fully compare all texts in an index just to see if the entire document is identical making full duplicate content detection impossible. Short cuts are taken, such as “fingerprinting” the positions and frequency of unusual words, but there are important trade offs)
As is determining if text makes semantic sense (until Google build an AI..)
A large amount of interlinked pages suddenly appearing on a well trusted site can rank well, even if completely unrelated to the sites typical content as well as being about a spammy topic such as “betting”.
Spammers are increasingly targeting hot topics; due to the huge amount of queries and lack of competition. For example StopTheHacker.com has a good write up on spammers poisoning Google trends topics and Symantec are reporting a swathe of World Cup email spam.
Simliar searches on Google, such as “England Cameroon”, seem to return less of these spammy results but still some. This is surprising as this match doesn’t exist this world cup.
I didn’t see these pages in the top 30 results for a search for “england algeria” on searches for either Bing or Yahoo. However, Bing and Yahoo are most likely protected not from superior spam filters but due to their lack of importance placed on freshness (have a look at Google’s “Quality Deserves Freshness” initiative) and their slower indexing abilities (I’m currently running an experiment where I registered 25 web 2.0′s on an unknown keyword; about 6 hours on Google has indexed 12 of them, Yahoo and Bing 0).

