PullQuotes: a game for indexing news quotes of public interest

#Some early statistics
Blue = correct responses. Red = incorrect. AP quotes are generally more “characteristic” of their speaker, and New York Times John Mccain quotes are more difficult to identify. Also, this game obviously needs more difficulty levels.
-> Play PullQuotes Preview Release
#Introduction
I first whetted my appetite for Freebase platform development with a couple of simple toy apps. Taught or Not tests knowledge of the historical influence graph, and Shot or Not tests knowledge of how historical figures have died.
Both toy applications were simple to make, because I was querying existing ontologies. Here’s an example query from the “shot or not” application:
mjt.freebase.MqlRead({
type:"/people/deceased_person",
name: null,
id: "{{ subject.id }}",
cause_of_death: [], limit: 1,
key : [{
namespace : "/wikipedia/en_id",
value : null
}],
})
I downloaded the set of deceased people to a local datastore, but the id value can also be set to “null” to get the full list sent in return, or it can taken from another query or data source.
This is clearly easy stuff, as I built my toy apps around the available data on Freebase that readily lends itself to applications. It’s no coincidence that the developers section of the site is full of apps that take advantage of the well-stocked ontologies, like movies, music, and geography.
If you want to get a more precise feel for what data makes for good queries, you can play around with the Freebase query editor, recently revamped with some useful new features. Better yet, try going to Powerset and asking it about something. Ask how many people live in Paris. Or when Elton John was born:

You’ll quickly find that numerical data lends itself well to queries, as well as the aforementioned domains. New domains are sprouting up all the time, driven by both top-down crawling and bottom-up approaches, like the popular microformats hCard and hCalendar.
#A Literal Problem
But quotations are more tricky. For instance, it wouldn’t be hard to scrape newspaper sites for quotations spoken by people running for public office, and then uploading those quotations with links to their speakers, which would be useful for a variety of sunlight applications.
But even better than simply linking to the speaker would also be linking to relevant topics, so that you could get query for a list of all quotes by people running for public office that are about healthcare, or gay rights, or major league baseball.
But try running the following Al Gore quote through a semantic tagging service :
“A catastrophic storm hit Bangladesh. The year before, the strongest cyclone in more than 50 years hit China …… We’re seeing the consequences that scientists have long predicted might be associated with continual global warming.”
You get back country entities for China and Bangladesh, and a natural disaster entity for “strongest cyclone”. But what about global warming? Or the environmental movement? Or tropical storms?
For the near future, humans will still be far better than any machine service at making leaps of judgement about what a human thought is really about. Machines are great with keywords, and humans are great with ideas.
This is my hypothesis, and I’d like to back it up with some quantitative data. That’s why I’m asking people playing the game to mark up quotes with Freebase types. I can then compare those to what I get back from automated services. The human computation approach of this experiment is inspired largely by the work of Luis von Ahn - who led the development of the reCaptcha project, and most recently launched some fun tagging games at www.gwap.com.
#Instructions
PullQuotes is simple:
- quotes are loaded, stripped of any identifying information.
- Choose whether you think it’s Mccain or Obama. (Apparently, Hillary is still in the race, but taking out gendered personal pronouns made a lot of the quotes more difficult to read.)
- You’ll then be prompted to optionally add topical tags to the quote, and the link will activate so that you can check out the originating article, and you can always see how you’re doing from the scores tab.
- Google login required to track your score. No, I can’t read your e-mail.
After I see how well it’s working, and see how well it scales on Google App Engine, I’ll flesh out the data inputs, and perhaps add features like a leaderboard. The Mccain/Obama choice is fun for a couple rounds, but it’s really just a “Hello world” type implementation.
I’m also getting some great data back. Although the sample size is limited to the people I’ve personally invited to try out the app, I’m already seeing some interesting results, like the inexplicable difference between identifying quotes in the New York Times vs. The AP.
As more people play with the toy, I may post other interesting statistics I find. Comments or suggestions are welcome, and if you tag up quotes, know that it’s going toward a good/useful cause.
