In this post, I’ll show how cybercriminals used hacked high-profile sites to drive search traffic to online stores that sell pirated copies of popular software and, presumably, steal credit card details.
I’ve been watching this sort of search spam for more than a year now. And after this post in Google’s Webmaster Help forum, I decided to take a closer look at this this problem.
Millions of interlinked spam pages are hosted on hacked high-profiles websites, which makes them rank well on Google and occupy top positions in search results for keywords targeted by spammers.
Hacked sites include:
For example, if you search for “Cheap Vista for Students” on Google, you’ll see something like this:
Almost 20 million results. Impressive, isn’t it? And although Google wouldn’t show more than 350 results (too little unique content), 99% of them were spam.
As you can see, the first page of results contain links mainly to reputable .edu domains. However, if I click on these links, an online store that sells pirated software will open.
This redirect is not a result of malicious activity of trojans on my computer (I’m on Linux). HTTP headers reveal the server-side 301 redirect from the .edu site to soft4pcs .com
HTTP/1.1 301 Moved Permanently
Date Mon, 28 Sep 2009 11:15:54 GMT
Server Apache/2.2.3 (Ubuntu) PHP/5.2.1
Cache-Control no-cache, must-revalidate
Location http://soft4pcs .com/shop/item/47/?cpn=wmtu_resnet_mtu_edu_soft2
Keep-Alive timeout=15, max=100
Content-Type text/html; charset=UTF-8
Alternative names of the same site: soft4windows .com, download-journal .com, oem-box .com – let’s call them SOFT4
Signs that this is not a legal site:
If you decide to buy something in this store, you’ll be redirected to a “secure” order form on
bill4soft . com/order/shop.
The payment site is actually the same SOFT4 (you can see it if you open the home page on bill4soft .com ) – it’s just an alternative domain name with a “verified” security certificate. However, as you can see, the fact the the certificate is verified doesn’t add any trust to it: the only information it provides about the owner is the domain name.
I’m not sure if you get an opportunity to download the software if you pay (on the FAQ page they mention that “sometimes” their email with download links can be “mistakenly” blocked by ISPs and deleted as spam), but I’m sure that the cybercriminal will find a “creative” way to use your credit curd number and all the personal details you’ve just provided to them.
Well, now you can see why they hack sites and spam Google – it’s a profitable (though illiegal) business.
Now lets talk about how they game Google and webmasters.
Many high-ranked websites has been hacked to place spammy intermediary pages there. Pages on established trusted domains will rank better than similar pages on unknown and less-known sites.
Where possible, hackers used legitimate web pages as templates for their spammy pages. They just replaced normal content of web pages with spammy keywords and links, preserving the markup. This way search engines shouldn’t be alerted since the pages don’t look alien.
Hackers usually create a few hundred to a few thousand such spammy pages that target specific keywords. To increase their ranking and make them all discoverable by search engine bots, these pages are interlinked.
The spammy pages are only for Google. I can see them only if I switch my browser’s User Agent to Googlebot. (I used the User Agent Switcher Firefox plugin). Otherwise, I would get a standard “404 – Page not found” error. This helps to hide the hack from webmasters, who might think that Google mistakenly indexed non-existent pages on their website (You can hear such speculations on the Google’s Webmaster Help forum quite often.) This “black hat” technique is called cloaking.
As far as I can understand, hackers should hide their illicit pages from site owners (404 error), show spammy pages to Googlebot, and redirect the rest visitors to the SOFT4 site.
Driving traffic to the SOFT4 site is the real purpose of these spam pages. They should rank well on Google (they really do) and when users click on the search results links, instead of unintelligible spam pages they’ll get redirected to real rogue online stores.
I have yet to figure out what triggers the redirect instead of the 404 error. On the same machine, I consistently get redirects from Firefox 3 under Linux, and consistently get 404 errors for the same URLs when I open them in Firefox 3.5 under WinXP.
It looks like they don’t like Firefox 3.5 for some reason. I tried a few different User Agents (Firefox 3 on Ubuntu, Firefox 3 on WinXP, IE7 on Vista, IE 7 on XP, Netscape 4.8 on Vista, Opera 9.25 on Vista) and they all redirected to soft4pcs site. However when I switched my browser’s User Agent to “Firefox 3.5.3 on XP” I consistently got 404 errors. Moreover, the error pages contained a cookie (e.g. site_domain_edu_soft2_visit=ban) that expires in a year, so that even if I switched the User Agent back to any of the “allowed” values, I would still get the 404 error page.
At first I thought all logic was in .htaccess files that contained conditional rewrite rules based on visitors User-Agents and requested URLs. This looked sensible. As you might have noticed, all spammy URLs have the same pattern:
It is easy to craft a regular expression that would redirect such requests to actual spammy pages.
However, when I found a few site powered by IIS, I had to dismiss the .htaccess hypothesis.
The only common denominator for all the hacked sites is PHP. Their HTTP headers reveal support of PHP. So most likely all the cloaking logic is inside a PHP script. This explains why 404 error pages contained hacker-defined cookies. Moreover, I’ve found a few broken scripts that reported PHP errors (they failed to include some files).
With PHP, hackers may have their spammy pages encrypted so that when webmasters try to scan their servers for keywords like “vista”, “viagra”, etc. they won’t be able to find anything. Encrypted pages can be decoded on the fly by the script. (A new message in the aforementioned forum thread proved this hypothesis: they use base64 encoding algorithm).
The same error messages revealed possible places where hackers can hide their files. In that case the site was WordPress powered, and the files were located in an special WordPress directory used for file uploads: wp-content/uploads. This directory usually has 777 permissions to make it possible to upload files (e.g. images, documents) directly from WordPress web interface. Here are the paths to illicit files:
Paths reported in the forum for another compromised WordPress installation:
/wp-link.php – (this file is not from a standard WP package)
Having analyzed content on many compromised sites, I think that most of them have been hacked using vulnerabilities in web software, used on the sites. Outdated versions of blogs, wikis, CMS, etc. can be found on almost every hacked site. And the “leader” is Moodle (open-source community-based tool for learning). It is very popular for educational sites and and it is very popular within all sorts of hackers and spammers. Just try this Google search to see what I mean. As you can see there are many known vulnerabilities and even a slightly outdated version or improperly maintained installation is a backdoor for hackers.
Vulnerabilities in third-party scripts is not the only possible attack vector. Stolen FTP credentials and brute-force attacks shouldn’t be discounted.
Well, now that you see how cloaked spammy pages work, let’s talk about how hackers managed to make those pages rank well on Google. It’s not enough just to place them on a high authority domain. High ranking is almost impossible without external inbound links from legitimate web pages (preferably with high PR) from other sites. Links from indexed legitimate web pages are also needed to have Google discover the spammy pages.
Here comes another type of illicit content on compromised web sites: legitimate web pages with loads of cloaked spam links injected. As in the previous case, the links are only injected when the page is requested by Googlebot and there is no trace of them if it’s a normal web browser.
Unmask Parasites tool can also be used to reveal such links.
Although the pages are not marked as suspicious here (after all these are normal links to legitimate web sites), link texts like “download windows xp professional cr-rom” can easily unmask them as alien.
To make the spam links less visible to “too curious” webmasters who might also want to check how their web pages look with Googlebot user-agent, hackers enclose the link block with the following two scripts.
These scripts make all spam links displayed inside a small 150×50 block. This block is not completely hidden to avoid undesirable suspicion from Googlebot. (The screenshot of the UNESCO’s IESALC page was made with disabled scripts to make the spam links prominent.)
The cloaked spam links on legitimate web pages promote spammy pages on other compromised websites and sometimes on their own site. This cross-promotion from high-ranked legitimate pages makes spammy pages rank well on Google and dominate on first pages of search results for relevant keywords.
Here is a basic scheme of this spam campaign.
I want to hope that posts like this can improve the situation with spam, so I appeal to all parties who can change things for the better.
To IT departments:
I’m sure you are aware of the problem and you know how to detect cloaking. So why are those spammy pages are still included in your search results? Please, delist spam sections of hacked legitimate sites. This won’t affect legitimate content and at the same time will make this sort of spamming useless. Even if you have someone manually identify and block spam sections of compromised websites used in this “pirated software” campaign, it should take just one day to complete the task. All those cloaked spam pages are hosted on just about a hundred of sites and all illicit URLs have the same pattern. Correct me if I’m wrong. And don’t forget to inform webmasters so that they know their sites are hacked.
Update: I’ve reported the “Cheap Vista for Students” search as spam here.
This post reflects the situation with search spam as I see it from my computer. It is enough to make some conclusions, but without access to the compromised sites the picture is not complete. If you happen to administer one of the compromised sites, or just know more about this issue, please share your information here.
Any other comments and corrections are welcome as well.