Sunday, October 21, 2012

Collecting data: Twitpic Scraper with Matlab


For my Machine Learning course this semester, the project I will be working on is Context-Based Object Recognition using Twitter. We would like to use the associated information about twitted images (author gender, hashtags, caption, comments etc) to try to improve recognition. Those extra features will be used as a prior, providing a context for the classification pipeline.

First step is to create our dataset. I wrote a Matlab script that scrapes the desired features from twitpic with a query for the keyword "pet". Here's a first result:
Original twitstream:

and collected data in Matlab (title of each subplot is the associated caption, which can be too long and overlap):
 The data is stored in a structure that currently has 3 fields: {image, caption, hashtag}. Here's a more detailed view where we can see the extracted #hashtag:

 Here's a link to the script I wrote: Twitpic Matlab scraper. Note that it uses the Twitpic API to get an XML response. We're looking at collecting a dataset of around 1000 instances.

One thing I have yet to address: it seems a lot of tweets are in non US-ASCII character set (for a "pet" query, a lot of the captions were in Japanese). So I will need to modify the script slightly to treat those tweets differently.


Edit: Older version of the script would crash if the visited profile was recently created (missing information such as post history) or if there was a video instead of an image.

Here is a more robust version of the Twitpic Matlab scraper. The user can now specify how many pages to scrape, and which tag to look for.