archive about

breaking the web: nytimes HTTP redirects

I’m writing a small script that will pick up interesting links from my twitter feed. Of course, URLs in twitter are shortened using multiple url-shorteners. So I use th relatively low-cost HTTP HEAD requests in order to follow the redirects and end up with the final URL.

It should work. And I actually does, in most cases.

Enter the NYTimes.com. Just try to follow the redirects of something like http://t.co/u1FuLkrY without a browser (ex. using HEAD, or curl or wget and you will see what I mean. I counted 12 (twelve!) HTTP requests before getting to the actual content (I used “curl -I -L http://t.co/u1FuLkrY”). I even tried http://longurl.org/ with no luck. (BTW I have the impression that you will end up with the actual URL only if you have cookies enabled, but I haven’t checked that.)

Ten or more HTTP redirects (or even 5) is way too much of a cost in resources (connections, CPU time, idle time) for any service to pay in order to expand a single link. This is the kind of practice that actually breaks the web.

Back to my script: I consider adding a notification that links to NYTimes.com will not show up, because it’s practically impossible to do so.