The Great CS Brain Teaser of 2009

“The Rise of URL Shorteners”

A common gripe among CS students is that the problems that they encounter in class aren’t relevant to the skills they’ll actually need to know to succeed. While many programs have done an admirable job keeping up with the rise of rapid prototyping, SAAS, and cloud computing, undergraduate level curriculum typically still leave much to be desired in the way of modern programming skills. As Jeff Atwood likes to say, there’s no substitute for learning on the battlefield.

It recently occurred to me that there is a fantastic CS problem for today’s students pulled directly from the recent discussion regarding the meteoric rise of URL shortening services.

Here’s the problem:

Let’s assume we’re studying link propagation in a small group of dense, real-time networks. For a variety of reasons, URL shortening services have exploded in popularity so that a majority of links being shared in these network are obfuscated with an address such as http://bit.ly/l3PPJ .
The naive solution is to map long URLs to all of the short URLs associated with them by using the APIs offered by the most popular URL shortening services.
But even when these APIs are available, there’s a complication. The default behavior of some of these services is to create a new short URL every time a request for a given long URL is sent. Even when the requests are only seconds apart and sent from the same IP address. This means that for some URL shortening services, there could be dozens or hundreds of short URLs for a single link.
What would be the best way for a developer to confront this challenge to studying link propagation?

As you may have guessed, this isn’t such a hypothetical scenario. I’m currently working on building intelligence around link propagation. My naive approach has resulted in building an open source meta-shortening service on App Engine (see my last blog post for more info), but it’s quickly become clear that there are inconsistencies beyond the obvious obfuscation issues that have been discussed.

Two solutions, the default behavior should be for the service to return one URL that will always be mapped to the long URL. An argument can be always be used to request a special, unique URL for analytics purposes.

The second solution is for all of these services to offer more complete information, such as a reverse-lookup for a given long URL, ideally including URLs built with slightly different GET params.

And regardless of these suggestions, the most important outcome is that conventions can be agreed upon, so that even the most basic research won’t be hampered by a byzantine system of compatibility woes.

As for now, I’m just going to fetch each and every URL to check for its location and status headers. It may be wasteful and inelegant, but at least it works.

This was posted 2 years ago. It has 0 notes.