Finally, a use for humanity!

I recently talked to an engineer friend who works at Facebook, who was thinking out loud about the biggest problem facing the company - in his eyes, at least - which is the formidable growth of the site’s analytical processing requirements.

There are plenty of well-documented ways for websites to save cycles, such as reducing requests, adding preload and postload components, and caching, caching, caching.

But Facebook is employing all of these, we can be sure. They’re not taking $100 million in debt just so they can serve more AJAX, but so that as the size and complexity of the social graph grows, they can confront the ballooning  cost and complexity of analytical processing procedures which have increased by an order of magnitude since this time last year.

In addition to the necessary but costly brute force approach of throwing hardware at the problem, they’ve also got some brilliant people working to establish more efficient processing techniques. The most exciting of these is the use of scalable hash ripple joins to establish confidence intervals.

Confidence intervals can be incredibly useful for saving cycles. For instance, calculating a user’s “eigenvector number”, or their level of influence, takes half as long to calculate to 99.1% accuracy as it does to reach 99.9% accuracy. By establishing confidence intervals, it’s trivial to rely on ‘good enough’ results.


This type of ingenuity will be necessary to implement in an increasing number of scenarios, especially as more startups base their business models around processing large amounts of data. I was recently talking to the founder of an exciting startup, working on a product related to data modeling. However, while he showed me the demo, it occurred to me that if the product is half as successful as it could conceivably be, they’re going to be looking at some serious processing challenges.

Whomever is designing their architecture should have a poster with the Twitter logo suspended above their desk. It would serve as an important reminder that 503 errors aren’t just disastrous for users, but that even with initial success, such a problem just asks to be exploited by the competition.

Yes, cloud data storage is supposed to solve precisely these types of problems. It will definitely prevent scaling problems, and to some extent, it’ll save expenses.But there’s another solution, one that is infinitely scalable and virtually cost-free.




 I particularly like the part of this Google tech talk where human computation enthusiast Luis von Ahn mentions the major flaw in the Matrix plot. Why are machines using humans for power, when it would be much more useful to employ humans as mechanical turks? In that scenario, the Matrix could just be a useful way to gather vast amounts of data for use by machine overlords. Because it’s not too far off from reality, it would have made for a much better Planet of the Apes style ending to the series. I could just imagine Neo shaking his fist, yelling “damn you, Google! Damn you all!”

Human computation is especially useful in cases of on-the-fly processing where a few possible results can be calculated with high confidence intervals. These usually take the form of an auto-suggest mechanization. Semantic search engines such as Powerset take natural language searches and convert them into data queries. Or try to, at least. Because that last 0.02% of accuracy is expensive, and not necessarily all that accurate. It makes much more sense to return possible query values with their natural language equivalents, and present the user to make a selection.


Human computation provides one success story after another. Luis von Ahn’s own reCaptcha successfully processes 30 million images a day. A more recent example is the protein folding game that may earn a Nobel Prize for its high score. The game evolved from earlier projects that resembled the SETI@HOME method of outsourcing cycles to computers, but they found much more success by relying on the fuzzy logic known as human intuition.

And a much more recognizable, ubiquitous success story of human computation is that of tagging - the collective human intelligence approach to analyzing literal, expressive data, such as photographs.


Aside from saving cycles, there’s another benefit of building human computation into on-the-fly processing procedures. It eases the learning curve that may be required by developers to form dynamic queries. After all, no platform developer wants to read a freaking book to get their grubby hands on some associative arrays.

Being used to the Facebook Platform, where it was virtually impossible using query language conventions to screw anything up, I was surprised at how often my dynamic query attempts returned errors. That’s because each domain in Freebase has its own schema, which means that dynamic query templates themselves need to be dynamically formed after retrieving the schema being used.

Freebase is quickly getting more accessible to developers, as libraries - like auto suggest -  begin to emerge. And the release of Acre will allow for a whole new population of developers get involved without learning the details of the Freebase architecture. But a great extension to their API would be not just to return errors, but to return possible results that could be compiled into The user would be glad to participate, as it provides them with a meaningful role. Which, last time I checked, is at the top of the list over at Stuff Human People Like.

I await more good examples of giving humans meaningful roles in the processing chain, whether its to win a Nobel Prize, or simply to select a search option. Let’s not be afraid to give them the extra work. From what I hear, they could use it.

This was posted 3 years ago. Notes.