Sunday, January 03, 2010

Content Delivery Networks

My company has a very specific requirement: we need to get our application to any desktop in the world in less than three minutes. There are business drivers for this that I shall not go into; basically it is so that potential customers don’t get bored waiting for our application to install and run. We are currently failing to do that for all users, and we suspect we are losing customers because we fall at the first fence.

The Problem is Discovered

Our installer is about 50MB, which is not huge, but we have been seeing an enormous variation in deployment times to various parts of the world. Currently we use a UK-based hosting service with high symmetric bandwidth, but routine log analysis revealed that the install times for some users exceeded 10 minutes, and many did not complete. A quick web search revealed that this is a well known problem, so well known in fact that there are many commercial solutions that come under the generic title of Content Delivery Networks (CDNs). The big players are companies like Akami and Limelight, but I am allergic to companies that won’t tell you the price, and I suspect our needs are too modest to be worth their while addressing. There is however a new class of companies like GoGrid emerging and there are established hosting players like Amazon (with CloudFront) and Rackspace (using Limelights CDN network) who are offering CDNs. The new-kid-on-the-block is Microsoft, who beta-launched the Azure CDN solution just as my investigations began.

CDN, like all hosting, is a highly commodified product. There are certainly modest differences in terms of things like upload flexibility (Azure stinks), clever torrent links (Amazon S3 rocks), and general UI friendliness, but there were no showstoppers. The only really important metrics are speed, reliability, and cost. Cost was easy, everyone who didn’t make it clear on their website in the first two minutes was discarded (are you starting to understand our business drivers now?), and the remaining companies were all so cheap that it wasn’t worth worrying about. This is because we are talking about a very small amount of data 50MB x 100 installs per month = 5GB and the pricing is never more than about 25 cents per GB. These businesses are built for large streaming media and Flash media files, not for tiny desktop installers like ours.

Reliability next: we are not particularly concerned about reliability given that we are statistically unlikely to lose enough business in the difference between four nines and five nines to make it worth basing a decision on. Everybody can do four nines.

So that left speed, which comes in two flavors: latency and bandwidth. Latency is critical for that snappy website that puts your shop window in front of the customer in less than a few seconds (which is sometimes all you have). Incidentally, I didn’t come across any CDN webhosts, particularly ones that support ASP.NET, but you have to imagine it is coming from Azure. In our case, bandwidth was going to dominate so that is what we needed to know about.

During my research, I came across Ryan Kearney’s comparison of CDN providers. He gives a great round-up of the price and features of many of the providers, as well as some latency statistics for a handful of international locations. He was kind enough to host a file for my test rig on his Rackspace account, which was much appreciated.

So there are plenty of CDN providers, but very little information available to allow you to compare them. For instance, India and China are two very important markets for us, but what is the bandwidth to them from each of the providers? Clearly we needed to do some measurements.

The Game is Afoot

How do you measure the bandwidth of a host to every country in the world? Well, there are many companies that offer website monitoring and will alert you if your website goes down, some of these have international monitoring capabilities, and some of them have page download time statistics. However, to get an accurate picture of download speeds you need a fairly sizable file so that the bandwidth lag dominates other factors such as DNS resolution or server latency. Only one web monitoring service actually downloaded the whole file, allowing us to make an accurate estimate of bandwidth. They are WebSitePulse, and I could not have done this analysis without them. They have the most monitoring stations in the world, the most detailed statistics, and a 30 day free trial, which I used for this investigation. I highly recommend them to anyone looking for sophisticated, international, web site monitoring.

We created a test file called Test1MB.zip, which was a zip file that was truncated to exactly 1MB. A zip file is largely incompressible and the extension stops most servers from trying (actually few offer HTTP compression, which is a serious omission but beyond the scope of this post). This was mounted on multiple hosts and WebSitePulse was configured to download the files periodically. The WebSitePulse trial limits you to 20 monitor stations at a time (and excludes Auckland and Melbourne), and I didn’t have access to all of the hosts from the beginning, so the statistics are not done to laboratory standards. However, the statistical picture that emerges is reliable enough to allow business decisions to be made.

The Runners and Riders

Host CDN Capable Notes
RapidSwitch No Our current host and representative of good quality hosting in the UK.
Azure CDN Yes Still in beta, and we literally started using it the day it opened, so there were teething problems.
Rackspace Yes Huge player in hosting and cloud computing.
Amazon CloudFront Yes CDN at the front, Amazon’s S3 at the back. Nominally still in beta, but frankly charging for something means you must be judged as a commercial product.
Amazon S3 No Our S3 hosting is in the US, so this is the standard candle for US-based cloud hosting.
GoGrid CDN Yes A high number of international points-of-presence, and more on the way.

Very few of the CDN companies offer free trials for some reason, but I think all are pay-as-you-go, which costs pennies for what we want. It took a bit of back-and-forth to get my GoGrid account set up, but their Twitter guy was great at fixing the problem once I made him aware of it. This meant that there are slightly less results for GoGrid. The whole trial ran for the best part of a month with roughly 15 minute poll times for every host. I had to change things around a bit as I went along to stay within the T&C’s of the WebSitePulse trial – you get $1000 to play with in total.

The following locations were monitored: Amsterdam, Bangalore, Beijing (2 monitors), Boston, Brisbane, Buenos Aires, Chicago, Dusseldorf, Guangzhou, Hong Kong, Houston, London, Los Angeles, Miami, Montreal, Mumbai, Munich, New York, Paris, San Francisco, Sao Paulo, Seattle, Shanghai, Singapore, Stockholm, Sydney (2 monitors), Tokyo, Toronto, Trumbull, Vancouver, Washington

The Results

The summary of the results is shown below:

Host Uptime Average 1MB DL Time (s)
GoGrid

100.00%

2.03

Rackspace CDN

100.00%

2.70

Amazon CloudFront

100.00%

4.46

Azure CDN

99.52%

4.67

Amazon S3

100.00%

5.04

RapidSwitch

99.98%

7.43

Here are the detailed results for all of the monitoring stations and hosts sorted into average download time order:

  GoGrid Rackspace Amazon CloudFront Azure CDN Amazon S3 RapidSwitch Average

New York

0.12

0.19

0.24

1.00

0.50

1.39

0.57

Boston

0.17

0.24

0.42

1.06

0.54

1.31

0.62

Trumbull

0.16

0.39

0.55

1.38

0.50

1.36

0.72

Washington

0.21

0.36

0.98

1.04

0.34

1.80

0.79

Houston

0.24

0.34

0.30

0.73

1.20

2.20

0.84

Paris

0.23

0.31

0.40

2.39

1.78

0.24

0.89

Dusseldorf

0.20

0.29

0.18

3.08

1.92

0.27

0.99

Amsterdam

0.16

0.15

0.47

2.43

2.70

0.24

1.03

Chicago

0.05

0.19

1.95

1.02

1.60

1.48

1.05

San Francisco

0.30

0.30

0.22

1.58

1.83

2.31

1.09

London

0.15

0.37

0.40

3.68

1.89

0.18

1.11

Vancouver

0.15

0.41

0.23

1.57

1.64

2.69

1.12

Toronto

0.40

0.91

0.48

2.58

1.99

1.67

1.34

Seattle

0.16

0.31

0.21

2.15

1.64

3.66

1.36

Munich

0.63

0.29

0.31

4.17

3.06

0.67

1.52

Miami

0.35

4.06

0.70

1.72

0.83

2.25

1.65

Stockholm

0.71

0.20

0.82

4.48

4.82

0.67

1.95

Los Angeles

0.29

0.39

0.34

2.99

3.81

6.47

2.38

Sao Paulo

2.53

2.77

2.70

2.79

3.49

3.50

2.96

Brisbane

0.45

0.42

3.40

4.26

5.96

6.00

3.42

Tokyo

2.00

1.01

1.40

3.04

4.35

8.84

3.44

Sydney

1.17

1.25

3.19

4.15

5.81

5.78

3.56

Bangalore

3.32

0.76

1.75

4.61

8.33

2.66

3.57

Sydney 2

0.18

4.63

3.74

5.07

7.41

5.29

4.39

Montreal

1.01

1.57

1.46

12.44

2.14

8.40

4.51

Mumbai

4.23

2.96

2.00

4.87

10.40

3.70

4.69

Buenos Aires

5.80

7.03

6.41

7.60

6.25

5.33

6.40

Singapore

3.52

1.27

3.24

9.66

8.67

13.62

6.66

Hong Kong

1.24

1.76

1.24

7.65

9.28

26.92

8.02

Beijing 2

5.72

8.23

11.62

7.95

11.20

17.53

10.38

Guangzhou

5.43

10.99

19.48

8.23

10.30

10.91

10.89

Beijing

8.50

14.33

23.92

13.37

16.29

71.55

24.66

Shanghai

17.32

20.56

52.27

19.37

23.83

24.16

26.25

Average

2.03

2.70

4.46

4.67

5.04

7.43

4.39

image

Here are the raw stats if you would like to do any further analysis of your own.

Conclusions

Clearly GoGrid and Rackspace are the best providers from the hosts tested. GoGrid has the best average performance and is unbeaten to almost all of the monitoring stations.

Asia is very badly served by all the hosts tested. Obviously there are dedicated hosting services for Asia, but the whole point of a CDN is that it is global. I expect partnerships are being drafted as I type.

Amazon S3 barely outperforms CloudFront on average, but peak download times per city are much better in some cases.

Montreal did much worse than I expected given that Canada is so well connected to the US.

Amazon and Azure CDN’s both perform equally well, although the uptime of Azure looks bad. Actually the Azure uptime was only really bad for the first few days, after that it was very good, so it is probably not a fair measure.

Did We Win?

Our original aim was to move 50MB in less than three minutes. Therefore our target time for 1MB is 180 / 50 = 3.6 seconds. Even with the fastest CDN host, we are still failing to meet this target for several cities. For Shanghai, we are a factor of five off. And of course this is before we get from where the monitoring stations are (which is probably a well connected hub) out to users at the network edge.

So big-iron can help us make significant improvements for very little effort and cost, but the war goes on. I might tell you how we finally win in a future post. Hint: we make the installer smaller.

Google AdSense Fraud

As I mentioned in my last post, we stopped advertising through Google’s Content Network very soon after we started experimenting with AdSense when we thought we had detected significant fraud. Having not spent much money by this time, the amounts involved were relatively small (less than £100), but the fraction was high. We considered at least 1/3 of our clicks to be fraudulent: deliberately, criminally fraudulent.

Proud of our forensic IT skills, we rushed to Do-No-Evil Google to report our discovery. Our report contained the bare minimum of facts; we were so convinced it screamed fraud we did not bother to go into much more detail than the websites and the Click Through Rates (CTRs). After a week we had a polite and considered response that attempted to persuade us that this was not fraud. Clearly more evidence was required, so we put together a verbose description of the main points that alerted us, and sent it off in expectation of an apology if not a seat on the board for our fraud-busting smarts.

Sadly they refused / failed to be convinced and eventually we had to agree to disagree. Obviously we will continue to use Google for advertising – what choice is there for an online business? But we firmly believe they need to put their house in order. If 1/3 of their revenue is fraudulent they will lose consumer confidence and possibly face sanctions for complicity.

Below I have included the email chain. The name of the operative has been removed as the problem is systemic not personal. The key point, and one I should have put front-and-center (in CAPS perhaps?), is under Red Flag 2 – Set 2. Namely that a badly named, unlinked, and parked-domain website achieved ten times the CTR of Google’s own homepage.

I would very much like to hear from anyone with similar experiences, or from anyone (at Google or otherwise) who disagrees with any of the points we made. Our transaction volume is small and our market is niche, so I am aware we need more data to make a statistically significant conclusion.

Initial Inquiry

[From Rupert to Google on 5/11/09]

Nearly *one third* of all content based CPC placements were obvious click fraud.

Many of the sites have exactly the same content and have a CTR of 100% (or greater in one case!).

I find it incredible that the market leader cannot spot something like this automatically. The fraudsters are not even trying to disguise it.

[Information about the site URLs and the fraud period]

Google’s Response

Hello Rupert,

We have received your request for an invalid clicks investigation. Thank you for your patience while we reviewed your account. I apologize for our delayed response. I understand you are concerned about the quality of clicks you have accrued from certain sites in our content network.

We reviewed your account and can confirm clicks from these sites. However, we found that these clicks are valid, and there is no activity that suggests you have been charged for invalid clicks. The clicks charged fit a pattern of normal user behaviour. As part of our review, the team looked through dozens of data points--including IP addresses, IP blocks, geographic concentrations, network activity, browser patterns, click timings, and any proprietary signals. However, none of those suggest an automated attack, nor collusion from unethical users. The clicks accrued reflect normal user traffic.

Many of the sites that you listed are parked domain sites. A parked domain site is an undeveloped web page belonging to a domain name registrar or domain name holder. Our AdSense for domains programme places targeted AdWords ads on parked domain sites that are part of the Google Network.

Users are brought to parked domain sites when they enter the URL of an undeveloped web page in a browser's address bar.

We've found that AdWords ads displayed on parked domain sites receive clicks from well-qualified leads within the advertisers' markets. In general, we've noticed that the return on investment gained on these pages is equal to or better than that gained on other pages in the search and content networks. However, if you aren't satisfied with the value of the traffic, you can prevent your ads from showing on parked domain sites by using the Site and Category Exclusion tool. Learn how at https://adwords.google.com/support/bin/answer.py?answer=86695&hl=en_GB.

I hope that this information helps address your concern. Please let me assure you that your security is a top priority for Google, and we will continue to monitor all clicks on your ads to prevent abuse. Let us know if you have further questions or if we can be of any more assistance. For more information about steps we take to combat invalid click activity, please visit https://adwords.google.com/support/bin/answer.py?answer=6114&hl=en_US.

Sincerely,

<Google Employee>

The Ad Traffic Quality Team

Rupert’s More Detailed Description

Hi <Google Employee>,

Thank you for getting back to me. I'm afraid I am still doubtful of the validity of these sites. Allow me to illustrate my concerns with some examples:

Red Flag 1

There are 10 sites in the list with a 100% CTR (radiolluvia.com even has 200%):

Domain

Clicks

Impressions

CTR

radiolluvia.com

2

1

200.00%

net-ebooks.com

1

1

100.00%

umtsfree.net

1

1

100.00%

littleabout.com

1

1

100.00%

mtncareer.com

1

1

100.00%

jonefm.com

1

1

100.00%

radiobendele.com

1

1

100.00%

iphalloween.info

1

1

100.00%

pdfee.com

1

1

100.00%

rf-online.com

1

1

100.00%

None of them contain any relevant content (which would be fine - I understand AdSense can never be a science), but most of them consist only of AdSense links. Are you really suggesting that users happen upon a parked domain by typing the URL above into the browser and happen to be in the market for radio planning software (which is what we make)?

We don’t make a mass-market product so I would never expect a high CTR on content networks even when targeting radio engineers – only a minority of them are even actively seeking our type of tool.

Red Flag 2

There are at least three sets of pages that contain practically the same content and come from the same IP subrange.

Set 1: gsmsandwich.com, jonefm.com, keonong.com, mtncareer.com, rf-indo.com, smsgupsup.com. They all look like this:

Set1  

Set 2: umtsfree.net, xlgprs.net, xlgprs.com, ir-hot.com. They all look like this:

Set2

I find it unlikely that in the space of one week, four people were using these sites as some sort of search or index portal (have they not heard of Google? J) and either searched to or browsed to our advert and found it relevant enough to click on. This was after only 254 impressions. Contrast this with Google’s own sponsored search results, which yielded only three hits for 2,051 impressions during the same period. If you really believe these statistics are true, surely you should buy this company immediately because they are 10 times better at advert placement than you are. Perhaps you should consider a smiling coed on the homepage?

Set 3: gamezerm.com, and radiobendele.com both have the same IP and look the same:

Set3

Can it be a coincidence that they have high CTR (50% and 100% respectively)?

In general, all of the sites in these sets have dubious registration details, often using the same registration anonymity service.

Red Flag 3

The click-through path is curious for the links from these sites. For instance, the umtsfree.net site links go through five 302 redirects before landing at the intended target. Here is the chain of URLs:

http://umtsfree.net/forward302.aspx?epc=eWpDPeDkCn7%2fPAWjJDHHizukuSQO4Z0sDU8KfdC9FQ%2b5yWjWwcxv5hXcA5nQpS0OqiEn07sYHTHFe%2fX9vFjDwUcSW01%2bS4WPEL0m7%2fX3z100tRxVe1Mg2zzaXK862vPp7hIJBvAoVV9DPRmnuG%2fkV0w5tbowrxB4AbcTtxa0Bsr%2fCztN7vTUOE0hGYneCC9V5jEY3PRhY5SAeWBCuCp7NzUBODKuSrYrmWbY4g3PHs9mBH08pqUSaY75VuOBggtVC6D5WjIQEZuFNJS10GQ6Bu%2f0JpRTb3xpAWZf4bPOguFyT3zwx6udcQe031GVCTob%2bAk5n3HzuAg2AOTMKncWxG%2bPl6vLUW4DWYQil2ZmY2ILRGYWgOHHAfIlNM1AHowYkUvb%2bBrYHbEQgD6PTID5%2fuaj3OsxHAwLVlhrL3uuu1S3zI4g9mOUab2fnM8yr%2brQmzu2a6UmflkA8s6PaAElxBNg9kZnshsusDIleugD02G6c%2bCTybRaQ0D1IYTOcfyeJLFDejgK2GqObGWs9Nm6J32886U0STHIAz75%2f%2b2snbAtAJQT48cwhAH%2fNQ%2fpaJiHkSON1cIxd7oFroekPJ8iyDhbYZ3VP1TJ0Z7HoHj6HKjeDaemj6LFb1Le0uwGKeLe%2bKc6LxdhYBtjD%2bXnGi0LkIajkmbqWe73rvyLNoRhtd1SsqgDT7wxMUBmPIaisppXnfp%2b8%2ftjiL8R9LU6QFh%2f8%2b4aTQGsrMOTyj55aZ4hx8l3UJck6utVoeVax7%2bOQACwyLYoyyvI4ml9DVsz%2f9Mh4WnmfFdVgPEqIxmJ%2bwnNIlKzX31GePRuLmgLHdeItkfnMPnUWIA8FB485RmGfrEwbF07d7v5JLcYuL4V62CKyW8dl3m89EvsbxcbOTOQN85CQsa5fKdgUZaZ2j1kFgi7Oj9J3MJ10oCQ5OjkZnoBWXZPFsfGxQTxg%2fGxh9k18jep%2b5sesYVB1jRYuej3ptGZBGivoDvEkFR%2fpxCbqB494irMYSWLmmx8c%2frOVZYeIe3XV9P7cBJ4da%2bcrLgJneN2nhKCOX0BDZsw%2bR1L93vZ6LgjgvFgolOFTVpkeF12ecpQWJg5jzm0AnoUhGdj%2fXzFJoJbgaxLnvBsFGql9%2f%2bYyJqMr8URZuovttYKmDemHYS0

http://rc12.overture.com/d/sr/?xargs=15KPjg141SnJamwr%2DocLXBROWAylwaxca58cluD5l4GtZf5iMxXOV4aaTCm8dxTOVxv1PdzPSW%5FqYSL%5FT5kPOJGweKQVWJGuXpjdLJxYw6Nq2jUNEbsYRzy%2DLvmIZGOX0E2laEOd%2D5mO7acZdRD05mjddAwByR%2D%5Flqw8yzxu4IQevVig0sskqFc5Z17tQp9bnAXOx7TLome97vhXfFfZwQ%2D%2DxDke%2DgSygTLyyj4WYa9VeHJi58obDIYo0L3ZbKzoLLOKeswIYJfRXG%2DYe62VuOrU6t8txuN2zT3r4MzgFZJP%5F%2DIlWJ3Ulvvv%2DbgfDfP4074wP1CfzqVHz3dxM5PXU3E5OufGXnbWw99E%5FOfpRQIMSv2xOO

http://clickserve.uk.dartsearch.net/link/click?lid=43000000042332928&ds_s_kwgid=58000000000470369&ds_e_adid=8770229031&ds_e_matchtype=standard&ds_e_kwdid=86608522531&ds_e_kwgid=5777302919&ds_url_v=2

http://ad-emea.doubleclick.net/clk;160208746;22377034;b;u=ds&sv1=42332928&sv2=2009111264&sv3=84754;%3fhttp://www.marshallward.co.uk/?aff=yahoo?&affsrc=acquisition&cm_mmc=yahoo-_-Generic-_-Generic-Catalogue-Keywords---Broad-_-catalogs

http://fls.doubleclick.net/act;sit=530730;spot=1529997;~dc_rdr=?http%3A//www.marshallward.co.uk/%3Faff%3Dyahoo%3F%26affsrc%3Dacquisition%26cm_mmc%3Dyahoo-_-Generic-_-Generic-Catalogue-Keywords---Broad-_-catalogs

http://www.marshallward.co.uk/?aff=yahoo?&affsrc=acquisition&cm_mmc=yahoo-_-Generic-_-Generic-Catalogue-Keywords---Broad-_-catalogs

This would allow the site operator to monitor clicks, which is a legitimate thing to do. It is also something you would need if you wanted to monitor and reward agents clicking the links for you. I appreciate that this in itself is not a “smoking gun”.

Conclusion

I suggest that at least the sites in these sets are fraudulent. The fact that they evaded the data mining checks you mention suggests to me that they were generated by a network of geographically distributed agents. These might be humans, encouraged by a share of the AdSense revenue or they might be an autonomous bot-net with a smart pattern of click behaviour (chaotic perhaps). Frankly the click-pattern doesn’t seem that smart to me, so I think there is another type of fraudster out there that is much smarter and rarely hits the same advertiser twice, this would prevent detection by the advertiser, and would only be detectable by Google themselves.

Let me be clear that I have no reason to doubt the validity of sponsored search or Gmail content adverts and I continue to use these networks, but I have serious doubts about the public AdSense network. I also do not argue that it is still good value for money (even if 1/3 of it is click-fraud), but nobody likes to be ripped off and it is in both our interests to fix it.

What do we expect? We are not looking for refund on these clicks (frankly I have wasted more money typing this email), instead we would like Google to consider my arguments, and if they agree, to improve their detection process. Excluding these sites is a poor option as I will have to spend time every day weeding out spammers from our content network placements. I appreciate that there will always be a number of frauds that are impossible to detect algorithmically, and that you are locked into an arms race with the fraudsters, but there seems to be more you could be doing to improve automatic detection.

Best regards,

Rupert Rawnsley.

Google’s Response

Hello Rupert,

Thank you for your reply and for providing us with the additional information. We appreciate your patience as we work to resolve this issue.

I can confirm the information you have mentioned. Many of the sites in question seem to have the same templates and show only AdSense ads.

However, the clicks that you accrued from these sites are valid. As mentioned in our previous email, these sites are a part of the domain park network. The sites in question do not have any specific content, but are simply "parked" for interested users to purchase the site from the domain hosting company. Also, domain parked sites can be former functioning websites whose domain name contracts have expired. Since these sites are largely created for temporary purposes, the template used may be the same across several websites. This is the reason you may find the same images or the same layout across several of these sites.

Once an interested user purchases the site or renews the domain name registration, the site is automatically removed from the domain parked network by the hosting company. Parked domain sites offer users ads that are relevant to the text they entered. In addition, some parked domain sites include a search box, which allows users to further refine their search. We've found that AdWords ads displayed on parked domain sites receive clicks from well-qualified leads within the advertisers' markets.

In general, we've noticed that the return on investment gained on these pages is equal to or better than that gained on other pages in the search and content networks.

We are sorry to learn that you are disappointed with the quality of clicks you accrued from these sites. Please be assured that the clicks are valid.

We strongly suggest you consider using the site-and-category exclusion tool to prevent your ads from showing on the domain park network.

Thank you for your patience and understanding.

Sincerely,

<Google Employee>

The Ad Traffic Quality Team