Here' s a really good explanation, through concrete examples, of Hidden Markov Models.
[Credits: Professor Moore, Carnegie Mellon SCS].
Saturday, December 15, 2012
Saturday, December 8, 2012
New university webpage
Crunch mode over. I just created my personal webpage on Carnegie Mellon's ECE servers.
Links to a project paper on automating Horizontal Gaze Nystagmus (part of the field sobriety tests performed by law enforcement in the US) and the final paper on using Twitter to predict users' political affiliations are on there.
My girlfriend thinks I really should learn HTML5 - That might be my Christmas break project.
Links to a project paper on automating Horizontal Gaze Nystagmus (part of the field sobriety tests performed by law enforcement in the US) and the final paper on using Twitter to predict users' political affiliations are on there.
My girlfriend thinks I really should learn HTML5 - That might be my Christmas break project.
Sunday, November 18, 2012
Twitter API & Ruby
Last Monday was the day the mid-semester report was due for our Machine Learning project. That means we went into full crunch mode the week before. And that means we went into a whole bunch of changes from our original idea.
First change, the new project is not so heavily computer vision oriented: we want to classify Twitter users on their political affiliation. This has direct relevance in the context of the 2012 presidential elections. Can we predict a particular user's vote?
Dataset
We collected Twitter user IDs through the Twitter API in Ruby. One of the major issues we had to work around is to get a ground truth - which unfortunately, is not provided in Twitter profiles.
To address this labeling issue, we created an approach using some domain knowledge, which ensures that we have a label for each of our user by inferring their political affiliation. We queried users who follow President Obama; while it might reflect some sensitivity to Democratic convictions, we also required that a user also follow several of the following list: Joe Biden, Stephen Colbert, Jon Stewart, Bill Clinton, Hillary Clinton, Al Gore...
Conversely, a 'Romney' label is applied to users that follow Paul Ryan, Sarah Palin, the NRA, Bill O'Reilly, Rush Limbaugh, Glenn Beck...
Algorithm
Once the user IDs were collected and labeled, we fetched their tweets, age, location, relationships (followers/followees)...
The idea is to apply a bag of words approach to each user - hence, the resulting histogram are part of the feature vector of each user.
What about the bins? We created a long list of "strong" words (in regex format) - such as "Gun Control", "Obamacare", "Occupy Movement". We believe these words polarize the tweets, thus capturing valuable information to cluster users.
The next step is to run Lloyd's algorithm on the instances in the dataset (aka k-means). Because we know that it is heavily dependent on the initialization, we have a method to have more coherent and more intuitive initializations. Each value of k is associated with an objective function we are trying to minimize, which reflects the intra-cluster variance. We swipe over a range of k, storing the objective function for each, which allows us to plot Objective function vs k.
This plot provides us with some insight as to which optimal value for k we should use. When each instance has a feature vector of length 150+, it is impossible to visualize the data and determine the number of clusters visually.
One benefit of using k-means is that we don't need to carry the dataset for classification. Once the cluster centers have been determined, that is all we need to classify a new instance (unlabeled user).
The last step is to perform dimensionality reduction to express the data with only the most relevant features. PCA, LSI in topic models are paths we will explore at that point.
Some issues we've had to deal with is getting around the restriction on the number of queries set by Twitter. A single authorization token (on Twitter for developers) will provide a maximum of 500 queries/hour. To get passed this, we created a bunch of authorization tokens, which we cycle through.
Here's what a single token authentication looks like, followed by a user timeline query:
'Token_Nico.rb' contains a class definition as follows:
First change, the new project is not so heavily computer vision oriented: we want to classify Twitter users on their political affiliation. This has direct relevance in the context of the 2012 presidential elections. Can we predict a particular user's vote?
Dataset
We collected Twitter user IDs through the Twitter API in Ruby. One of the major issues we had to work around is to get a ground truth - which unfortunately, is not provided in Twitter profiles.
To address this labeling issue, we created an approach using some domain knowledge, which ensures that we have a label for each of our user by inferring their political affiliation. We queried users who follow President Obama; while it might reflect some sensitivity to Democratic convictions, we also required that a user also follow several of the following list: Joe Biden, Stephen Colbert, Jon Stewart, Bill Clinton, Hillary Clinton, Al Gore...
Conversely, a 'Romney' label is applied to users that follow Paul Ryan, Sarah Palin, the NRA, Bill O'Reilly, Rush Limbaugh, Glenn Beck...
Algorithm
Once the user IDs were collected and labeled, we fetched their tweets, age, location, relationships (followers/followees)...
The idea is to apply a bag of words approach to each user - hence, the resulting histogram are part of the feature vector of each user.
What about the bins? We created a long list of "strong" words (in regex format) - such as "Gun Control", "Obamacare", "Occupy Movement". We believe these words polarize the tweets, thus capturing valuable information to cluster users.
The next step is to run Lloyd's algorithm on the instances in the dataset (aka k-means). Because we know that it is heavily dependent on the initialization, we have a method to have more coherent and more intuitive initializations. Each value of k is associated with an objective function we are trying to minimize, which reflects the intra-cluster variance. We swipe over a range of k, storing the objective function for each, which allows us to plot Objective function vs k.
This plot provides us with some insight as to which optimal value for k we should use. When each instance has a feature vector of length 150+, it is impossible to visualize the data and determine the number of clusters visually.
One benefit of using k-means is that we don't need to carry the dataset for classification. Once the cluster centers have been determined, that is all we need to classify a new instance (unlabeled user).
The last step is to perform dimensionality reduction to express the data with only the most relevant features. PCA, LSI in topic models are paths we will explore at that point.
Some issues we've had to deal with is getting around the restriction on the number of queries set by Twitter. A single authorization token (on Twitter for developers) will provide a maximum of 500 queries/hour. To get passed this, we created a bunch of authorization tokens, which we cycle through.
Here's what a single token authentication looks like, followed by a user timeline query:
require 'twitter'
require 'json'
require 'pp'
require_relative 'Token_Nico.rb'
YOUR_CONSUMER_KEY= "mnnH8LoWM7#########"
YOUR_CONSUMER_SECRET ="FTYr8xgRdTyMEACPEO9Jfxl##################"
YOUR_OAUTH_TOKEN = "23894652-NeIDQ4JeHMofJxldF#######################"
YOUR_OAUTH_TOKEN_SECRET= "UgvOuWpaTTnhKKLpiHz9##################"
@client = Twitter::Client.new(
:consumer_key => YOUR_CONSUMER_KEY,
:consumer_secret => YOUR_CONSUMER_SECRET,
:oauth_token => YOUR_OAUTH_TOKEN,
:oauth_token_secret => YOUR_OAUTH_TOKEN_SECRET
)
# get timeline from id
tw_data= @client.user_timeline( id.to_i, :count=>200, :exclude_replies=>false, :include_entities=>true)
'Token_Nico.rb' contains a class definition as follows:
class ClassTokenToken is an array of token arrays. The idea is to loop through the authentication keys until either we get the user timeline, or we determine that the timeline is protected (profile set to private).
def self.re_configure(i)
pp "reconfigure i=#{i}"
pp Token[i][:consumer_key]
@client = Twitter::Client.new(
:consumer_key => Token[i][:consumer_key],
:consumer_secret => Token[i][:consumer_secret],
:oauth_token =>Token[i][:oauth_token],
:oauth_token_secret => Token[i][:oauth_token_secret]
)
return @client
end
def self.howmany?
return Token.count
end
end
Sunday, October 21, 2012
Collecting data: Twitpic Scraper with Matlab
For my Machine Learning course this semester, the project I will be working on is Context-Based Object Recognition using Twitter. We would like to use the associated information about twitted images (author gender, hashtags, caption, comments etc) to try to improve recognition. Those extra features will be used as a prior, providing a context for the classification pipeline.
First step is to create our dataset. I wrote a Matlab script that scrapes the desired features from twitpic with a query for the keyword "pet". Here's a first result:
Original twitstream:
and collected data in Matlab (title of each subplot is the associated caption, which can be too long and overlap):
The data is stored in a structure that currently has 3 fields: {image, caption, hashtag}. Here's a more detailed view where we can see the extracted #hashtag:
Here's a link to the script I wrote:
One thing I have yet to address: it seems a lot of tweets are in non US-ASCII character set (for a "pet" query, a lot of the captions were in Japanese). So I will need to modify the script slightly to treat those tweets differently.
Edit: Older version of the script would crash if the visited profile was recently created (missing information such as post history) or if there was a video instead of an image.
Here is a more robust version of the Twitpic Matlab scraper. The user can now specify how many pages to scrape, and which tag to look for.
Tuesday, September 18, 2012
Update -
It's been some time since my last post. Many things have happened in the meantime:
A full blog is dedicated to the work I did their for their computer vision algorithm, but unfortunately it's confidential. A few "customers" have adopted "adversarial behaviors" and managed to extirpate money out of the kiosks, so ecoATM won't let me talk about their recognition system. The work I did addressed these issues though.
Taking all the graduate Computer Vision courses at UCSD my senior year played out nicely - I can now focus on ML here at Carnegie Mellon.
The university has been great so far - the curriculum features so many interesting classes, I feel like taking so many of them. There are tons of social, technical, recruiting events happen every day. I don't think I've cooked anything in the last 2 weeks - I've been eating for free by going to a few talks. The Job fair was insane: so many tech companies and startups showed up, all eager to recruit CMU students. I had an interview this very morning with Microsoft. I'd love to do something in Computer Vision with the Microsoft Research team! Maybe work on the next Photosynth?
Fingers crossed.
- I interned at ecoATM, in San Diego -
A full blog is dedicated to the work I did their for their computer vision algorithm, but unfortunately it's confidential. A few "customers" have adopted "adversarial behaviors" and managed to extirpate money out of the kiosks, so ecoATM won't let me talk about their recognition system. The work I did addressed these issues though.
- I graduated from UCSD!
- I moved to Pittsburgh for grad school!
Taking all the graduate Computer Vision courses at UCSD my senior year played out nicely - I can now focus on ML here at Carnegie Mellon.
The university has been great so far - the curriculum features so many interesting classes, I feel like taking so many of them. There are tons of social, technical, recruiting events happen every day. I don't think I've cooked anything in the last 2 weeks - I've been eating for free by going to a few talks. The Job fair was insane: so many tech companies and startups showed up, all eager to recruit CMU students. I had an interview this very morning with Microsoft. I'd love to do something in Computer Vision with the Microsoft Research team! Maybe work on the next Photosynth?
Fingers crossed.
Thursday, February 23, 2012
Display notes with MIDI
When jamming with new people who don't have a very sharp musical intuition, explaining what you are playing can really bring the spontaneity of the moment down.
Using MIDI messages and a simple serial communication system between my keyboard's MIDI out, an Arduino, and Processing, we can display notes being struck on the computer screen in real time, for everyone to see.
Here's what it looks like:
Here is the setup:
Files:
- Arduino Code
- Processing Code
Possible improvements:
- Use 3rd midi byte to extract velocity (pressure applied to the key) and change size of text accordingly. This will emphasize keys that are struck harder.
- Polyphony: recognize and display chords.
- Display only the bass line, since that carries the most information for improvisations.
Using MIDI messages and a simple serial communication system between my keyboard's MIDI out, an Arduino, and Processing, we can display notes being struck on the computer screen in real time, for everyone to see.
Here's what it looks like:
How it works:
MIDI messages are sent to the microcontroller over serial. The 3 message bytes are interpreted and a corresponding character is printed over serial, for the Processing program to read. The problem is that the Arduino can't read incoming bytes from the keyboard and print over the same serial channel. So we need SoftwareSerial to create another serial communication, over one of the digital pins (I am using 2).
Note: We need to establish the baud rate of the microcontroller to 31250 for MIDI communication.
Here is the setup:
Files:
- Arduino Code
- Processing Code
Possible improvements:
- Use 3rd midi byte to extract velocity (pressure applied to the key) and change size of text accordingly. This will emphasize keys that are struck harder.
- Polyphony: recognize and display chords.
- Display only the bass line, since that carries the most information for improvisations.
Subscribe to:
Posts (Atom)