breaking the web: nytimes HTTP redirects
I’m writing a small script that will pick up interesting links from my twitter feed. Of course, URLs in twitter are shortened using multiple url-shorteners. So I use th relatively low-cost HTTP HEAD requests in order to follow the redirects and end up with the final URL.
It should work. And I actually does, in most cases.
Enter the NYTimes.com. Just try to follow the redirects of something like http://t.co/u1FuLkrY without a browser (ex. using HEAD, or curl or wget and you will see what I mean. I counted 12 (twelve!) HTTP requests before getting to the actual content (I used “curl -I -L http://t.co/u1FuLkrY”). I even tried http://longurl.org/ with no luck. (BTW I have the impression that you will end up with the actual URL only if you have cookies enabled, but I haven’t checked that.)
Ten or more HTTP redirects (or even 5) is way too much of a cost in resources (connections, CPU time, idle time) for any service to pay in order to expand a single link. This is the kind of practice that actually breaks the web.
Back to my script: I consider adding a notification that links to NYTimes.com will not show up, because it’s practically impossible to do so.