Data Driven Security took the time to analyze the raw data that I published in my recent post on Sucuri blog about how I used Bitly data to understand the scale of the Darkleech infection.
In their article, they have a few questions about data formats, meaning of certain fields and some inconsistencies, so I’ll try to answer their questions here and explain how I worked with the data.
So I needed to get information about all the links of the “grantdad” bitly account.
I checked the API and somehow missed the “link_history” API request (it was the first time I worked with the bitly API), so I decided to screenscrape the web pages where bitly listed the links creaded by grantdad. 1,000 pages with 10 links on each. Since pages didn’t contain all the information I needed I only collected the short links so that I could use them later in API calls to get more detailed information about each of them.
As you can see I was limited with the 10,000 links that bitly made available via their web interface. Not sure if I could get more links via that link_history API. Right now it returns “ACCOUNT_SUSPENDED” and when it was not suspended, API calls for known links beyond those 10,000 produced various errors.
When I compiled my list of 10,000 links I used the following API calls to get information about each of the links:
API:link/referring_domains — for each link, it returned a list of domains referring traffic to this link (in our case sites, containing the iframes) along with the number of clicks from each domain. This helped me compile this list of iframe loads per domain: http://pastebin.com/HYaY2yMb. Then I tried to resolve each of the domain names and created this list of infected domains per IP address: http://pastebin.com/Gxr51Nc1.
I also used this API call to get number of iframe loads for each bitly link.
API:link/info – this gave me timestamps of the links and their long URLs (the rest information was not interesting for my research). Unfortunately, this particular API call is poorly documented so I can only guess what this timestamp mean. It is actually called “indexed“. But I guess it’s the time when the link was created. It is definiteley not the time of the first click because there were no registered clicks for many of the links. As a result, I compiled these datasets:
which contain tab separated values of “bitly link id”, “timestamp”, “# of clicks/iframe loads”, and “long URL”.
At that point, I already had the number of iframe loads from the previous step (“link/referring_domains“). Then, for readability, I converted the numeric timestamp using this Python function datetime.datetime.fromtimestamp(). However, you can notice that the second dataset (for January 28th) has a different date format. Instead of “2014-01-25 19:10:07” it uses “Jan 28 23:10:19“. Why? Because of pastebin.com. Because it doesn’t allow unregistered users to post more than 500Kb of data (I deliberately post such data as a guest). Removing the year from the date allowed me to save 4 bytes on each row and fit the dataset in 500Kb.
Actually this 500Kb limit is the reason why I have separate pastebins for each date and only specify bitly ids instead of the full bitly links.
And finally, API:link/countries – for each link, it returned a list of countries referring traffic to this link along with the number of clicks from those countries. This helped me compile this list of iframe loads per country http://pastebin.com/SZJMw3vx.
When I wrote my blogpost on February 5th, I noticed that there were new links available on the Bitly.com grantdad acount page. I began to browse pages of his links and figured that most of the links were the same (Jan 25 and 28) but around 1,700 of them were new (late February 4th and the beginning of February 5th). I immediately repeated the same procedure that included screenscraping, and 3 API calls for each new bitly link. After that I created this new dataset http://pastebin.com/YecHzQ1W and updated the rest datasets (domains, countries, etc).
If you check the total numbers in these two datasets (domains) http://pastebin.com/HYaY2yMb and (countries) http://pastebin.com/SZJMw3vx you’ll see the different in total number of iframe loads: 87152 and 87269. I don’t know why. I used the same links just different API calls (referring_domains and countries) and the totals are supposed to be the same. I didn’t save the “per link” data so can’t tell exactly whether it was my error or the bitly API produces slightly inconsistent results. Anyway, the difference is neglectable for my calculations and doesn’t affect the estimations (given that I tried to underestimate when in doubt ).
I didn’t check for duplicates that guys from Data Driven security found in my datasets. They must have to do with my screenscraper and the way that Bitly.com displays links in user accounts. The total number of scraped links matched the number of links that Bitly reported for the user. The extra 121 clicks that these duplicates are responsible for are so close to the 117 difference that I mentioned above so I wonder if these numbers somehow connected? Anyway, this shouldn’t affect the accuracy of the estimates either.
For the map, I used the Geochart from Google visualization. Since the difference in numbers of clicks for the top 3 countries and the lower half of the list was 3 orders of magnitude, I had to re-scale the data so that we could still see the difference between countries with 50 iframe loads and 2 iframe loads.. I used a square of a natural logarithm for that
Math.pow( Math.round( Math.log(i)), 2)
This gave me a nice distribution from 0 to 100 and the below map.
After checking the link creation per minute-of-an-hour distribution, Bob Rudis wondered:
This is either the world’s most inconsistent (or crafty) cron-job or grantdad like to press ↑ + ENTER alot.
I guess, it was neither cron-job nor manual link creation. I think the [bitly] links were created on-demand. So if someone loads a page and Darkleech needs to inject an iframe then it (or rather the server it works with) generates a new bitly link. Given that this malware may lurk on a server, there may be time periods when no malicious code is being injection into any web pages across all the infected servers. At the same time, the bitly links with zero clicks may refer to page loads by various bots (including our own) when the malicious code is injected but the iframe is never loaded. On the other hand, the volume of bitly links with 0 clicks (~35%) suggests that there might really more complex link generation mechanism than a simple “on-demand” approach.
Anyway, that’s great when people double check our data and try to find new interesting patterns there. Make sure to let me know (here or on the Sucuri blog) if you find more patterns and inconsistencies, or have any comments on how the Darkleech works.