Your Data Science Portfolio: Be An Open Data Curator

In the drive towards the semantic web, mailing lists are ripe, low hanging fruit. They are full of wisdom totally inaccessible to the casual user. To unlock this wealth of knowledge for our apps, we need it in a format like the Stack Exchange data dump.

This data dump format is to Stack Exchange what JSON is to JavaScript: an exchange format that spawns growth in ecosystems separate from the context it was born from. Virtually every language now supports JSON, and in the future, every web service will be able to consume these data dumps - just like our Wikipedia widgets today.

                JSON : JavaScript
stackexchange.com.7z : Stack Exchange

The Stack Exchange data dump format isn't exactly appropriate for a mailing list, but it's a great starting point - both for mailing lists 3.0 and your data science career. Bringing the gems of wisdom from the old archives to this machine-readable format is called data curation. Basically, this means someone has to read the mailing list, categorize and tag each message, and export into a format like the ones found on archive.org. It's an exercise in ETL with data cleaning. In buzzword speak:

Mailing Lists 3.0 = Web 2.5 APIs + data_dump.7z + crowdsourcing

Data curation isn't glamorous and hardly anybody will notice or care about the hard work you've done. It's tedious work to clean, format and verify data against a published schema. It's also a basic competency that employers expect: if you can't do it, they don't need you. The longer you babysit an archive, the better it looks in your portfolio. These formats are a work in progress, so you will have to actively maintain an archive even if the mailing list is dead. With thousands of mailing lists in the public domain, there's a LOT of data to be curated.

Given that the new digital "generation C" are so called for their habits of Creation, Curation, Connection and Community, your open data curation project is merely keeping up with the times to extend your shelf-life and delay the day the kids replace you.

Instructions for Extract Clean Transform Load (ECTL):

STEP 0: Choose your target
Look for a small, quiet mailing list used by specialists in a profession that compliments your CV. Steer clear of for-profit services like craigslist and google that may feel threatened by data aggregators.

STEP 1: Extract
Write a spider to crawl the archive and get the title, message, timestamp, and sender of each post. Read my previous post for resources to learn how.

STEP 2: Store
Use SQL, Mongo, or whatever you are most comfortable with. I chose Elasticsearch for text search so i could do some basic word counts and figure out what tags to use. A Bitnami image had me up and running in no time.

STEP 3: Clean
Tag and categorize every message in the archive. This tedious stage is why you should target a very small list. Scripts can automate some of the work, but each entry needs to verified manually. We can't surf to web 3.0 on dirty data.

STEP 4: Transform
Write a script to query your database and save it in the proper format. Download a sample archive from archive.org and study the format. The closest thing to an official specification of the schema is here.

To be worth your salt, you need to be able to figure this one out for yourself. Transforming data between APIs is the single most valuable Data Science skill that draws the line between "real" data wranglers and script kiddies. Someone asked for an XSD in the comments and was told "That's what separates the men from the boys. If one can't work out what the data types are in the dump with the below description, then maybe one is not worthy and should stick to the happy world of copy-paste JavaScript."

STEP 5: Load (Verify)
Import the data dump into a fresh Askbot, Stack Exchange Data Explorer, or Bitnami OSQA. Surf the Q&A site a bit. Anonymize the names of the users for the sake of privacy.

STEP 6: Publish
Save the data dump on your cloud storage account. Send a message to the mailing list with a link to the archive and your intended maintenance schedule. Offer the user identity keys to the admin if s/he wants it. Community members may want to connect author attribution to social media accounts. Push your scripts to github.

STEP 7: Add "Open Data Curator" to your CV and be patient.
"Data curation" is still a new concept to a lot of people. Enjoy the short-lived feeling of being special before recruiters expect it from all of their candidates.

Next month's post is on visualization. Coincidentally, famo.us has just ended its tease to open to the public and disrupt our UX mind space.