Sunday, November 18, 2012

Twitter API & Ruby

     Last Monday was the day the mid-semester report was due for our Machine Learning project. That means we went into full crunch mode the week before. And that means we went into a whole bunch of changes from our original idea.
First change, the new project is not so heavily computer vision oriented: we want to classify Twitter users on their political affiliation. This has direct relevance in the context of the 2012 presidential elections. Can we predict a particular user's vote?

Dataset
     We collected Twitter user IDs through the Twitter API in Ruby. One of the major issues we had to work around is to get a ground truth - which unfortunately, is not provided in Twitter profiles.
To address this labeling issue, we created an approach using some domain knowledge, which ensures that we have a label for each of our user by inferring their political affiliation. We queried users who follow President Obama; while it might reflect some sensitivity to Democratic convictions, we also required that a user also follow several of the following list: Joe Biden, Stephen Colbert, Jon Stewart,  Bill Clinton, Hillary Clinton, Al Gore...
Conversely, a 'Romney' label is applied to users that follow Paul Ryan, Sarah Palin, the NRA, Bill O'Reilly, Rush Limbaugh, Glenn Beck...

Algorithm
     Once the user IDs were collected and labeled, we fetched their tweets, age, location, relationships (followers/followees)...
The idea is to apply a bag of words approach to each user - hence, the resulting histogram are part of the feature vector of each user.
What about the bins? We created a long list of "strong" words (in regex format) - such as "Gun Control", "Obamacare", "Occupy Movement". We believe these words polarize the tweets, thus capturing valuable information to cluster users.
The next step is to run Lloyd's algorithm on the instances in the dataset (aka k-means). Because we know that it is heavily dependent on the initialization, we have a method to have more coherent and more intuitive initializations. Each value of k is associated with an objective function we are trying to minimize, which reflects the intra-cluster variance. We swipe over a range of k, storing the objective function for each, which allows us to plot Objective function vs k.
This plot provides us with some insight as to which optimal value for k we should use. When each instance has a feature vector of length 150+, it is impossible to visualize the data and determine the number of clusters visually.
One benefit of using k-means is that we don't need to carry the dataset for classification. Once the cluster centers have been determined, that is all we need to classify a new instance (unlabeled user).
The last step is to perform dimensionality reduction to express the data with only the most relevant features. PCA, LSI in topic models are paths we will explore at that point.

Some issues we've had to deal with is getting around the restriction on the number of queries set by Twitter. A single authorization token (on Twitter for developers) will provide a maximum of 500 queries/hour. To get passed this, we created a bunch of authorization tokens, which we cycle through.

Here's what a single token authentication looks like, followed by a user timeline query:
require 'twitter'
require 'json'
require 'pp'
require_relative 'Token_Nico.rb'

YOUR_CONSUMER_KEY= "mnnH8LoWM7#########"
YOUR_CONSUMER_SECRET ="FTYr8xgRdTyMEACPEO9Jfxl##################"
YOUR_OAUTH_TOKEN = "23894652-NeIDQ4JeHMofJxldF#######################"
YOUR_OAUTH_TOKEN_SECRET= "UgvOuWpaTTnhKKLpiHz9##################"

@client = Twitter::Client.new(
                              :consumer_key => YOUR_CONSUMER_KEY,
                              :consumer_secret => YOUR_CONSUMER_SECRET,
                              :oauth_token => YOUR_OAUTH_TOKEN,
                              :oauth_token_secret => YOUR_OAUTH_TOKEN_SECRET
                              )
# get timeline from id
tw_data= @client.user_timeline( id.to_i, :count=>200, :exclude_replies=>false, :include_entities=>true)

'Token_Nico.rb' contains a class definition as follows:

class ClassToken
    def self.re_configure(i)
    pp "reconfigure i=#{i}"
        pp Token[i][:consumer_key]
        @client = Twitter::Client.new(
                                      :consumer_key => Token[i][:consumer_key],
                                      :consumer_secret => Token[i][:consumer_secret],
                                      :oauth_token =>Token[i][:oauth_token],
                                      :oauth_token_secret => Token[i][:oauth_token_secret]
                                     
                                      )
        return @client
    end

    def self.howmany?
        return Token.count
    end
end
 Token is an array of token arrays. The idea is to loop through the authentication keys until either we get the user timeline, or we determine that the timeline is protected (profile set to private).