Here' s a really good explanation, through concrete examples, of Hidden Markov Models.
[Credits: Professor Moore, Carnegie Mellon SCS].
Saturday, December 15, 2012
Saturday, December 8, 2012
New university webpage
Crunch mode over. I just created my personal webpage on Carnegie Mellon's ECE servers.
Links to a project paper on automating Horizontal Gaze Nystagmus (part of the field sobriety tests performed by law enforcement in the US) and the final paper on using Twitter to predict users' political affiliations are on there.
My girlfriend thinks I really should learn HTML5 - That might be my Christmas break project.
Links to a project paper on automating Horizontal Gaze Nystagmus (part of the field sobriety tests performed by law enforcement in the US) and the final paper on using Twitter to predict users' political affiliations are on there.
My girlfriend thinks I really should learn HTML5 - That might be my Christmas break project.
Sunday, November 18, 2012
Twitter API & Ruby
Last Monday was the day the mid-semester report was due for our Machine Learning project. That means we went into full crunch mode the week before. And that means we went into a whole bunch of changes from our original idea.
First change, the new project is not so heavily computer vision oriented: we want to classify Twitter users on their political affiliation. This has direct relevance in the context of the 2012 presidential elections. Can we predict a particular user's vote?
Dataset
We collected Twitter user IDs through the Twitter API in Ruby. One of the major issues we had to work around is to get a ground truth - which unfortunately, is not provided in Twitter profiles.
To address this labeling issue, we created an approach using some domain knowledge, which ensures that we have a label for each of our user by inferring their political affiliation. We queried users who follow President Obama; while it might reflect some sensitivity to Democratic convictions, we also required that a user also follow several of the following list: Joe Biden, Stephen Colbert, Jon Stewart, Bill Clinton, Hillary Clinton, Al Gore...
Conversely, a 'Romney' label is applied to users that follow Paul Ryan, Sarah Palin, the NRA, Bill O'Reilly, Rush Limbaugh, Glenn Beck...
Algorithm
Once the user IDs were collected and labeled, we fetched their tweets, age, location, relationships (followers/followees)...
The idea is to apply a bag of words approach to each user - hence, the resulting histogram are part of the feature vector of each user.
What about the bins? We created a long list of "strong" words (in regex format) - such as "Gun Control", "Obamacare", "Occupy Movement". We believe these words polarize the tweets, thus capturing valuable information to cluster users.
The next step is to run Lloyd's algorithm on the instances in the dataset (aka k-means). Because we know that it is heavily dependent on the initialization, we have a method to have more coherent and more intuitive initializations. Each value of k is associated with an objective function we are trying to minimize, which reflects the intra-cluster variance. We swipe over a range of k, storing the objective function for each, which allows us to plot Objective function vs k.
This plot provides us with some insight as to which optimal value for k we should use. When each instance has a feature vector of length 150+, it is impossible to visualize the data and determine the number of clusters visually.
One benefit of using k-means is that we don't need to carry the dataset for classification. Once the cluster centers have been determined, that is all we need to classify a new instance (unlabeled user).
The last step is to perform dimensionality reduction to express the data with only the most relevant features. PCA, LSI in topic models are paths we will explore at that point.
Some issues we've had to deal with is getting around the restriction on the number of queries set by Twitter. A single authorization token (on Twitter for developers) will provide a maximum of 500 queries/hour. To get passed this, we created a bunch of authorization tokens, which we cycle through.
Here's what a single token authentication looks like, followed by a user timeline query:
'Token_Nico.rb' contains a class definition as follows:
First change, the new project is not so heavily computer vision oriented: we want to classify Twitter users on their political affiliation. This has direct relevance in the context of the 2012 presidential elections. Can we predict a particular user's vote?
Dataset
We collected Twitter user IDs through the Twitter API in Ruby. One of the major issues we had to work around is to get a ground truth - which unfortunately, is not provided in Twitter profiles.
To address this labeling issue, we created an approach using some domain knowledge, which ensures that we have a label for each of our user by inferring their political affiliation. We queried users who follow President Obama; while it might reflect some sensitivity to Democratic convictions, we also required that a user also follow several of the following list: Joe Biden, Stephen Colbert, Jon Stewart, Bill Clinton, Hillary Clinton, Al Gore...
Conversely, a 'Romney' label is applied to users that follow Paul Ryan, Sarah Palin, the NRA, Bill O'Reilly, Rush Limbaugh, Glenn Beck...
Algorithm
Once the user IDs were collected and labeled, we fetched their tweets, age, location, relationships (followers/followees)...
The idea is to apply a bag of words approach to each user - hence, the resulting histogram are part of the feature vector of each user.
What about the bins? We created a long list of "strong" words (in regex format) - such as "Gun Control", "Obamacare", "Occupy Movement". We believe these words polarize the tweets, thus capturing valuable information to cluster users.
The next step is to run Lloyd's algorithm on the instances in the dataset (aka k-means). Because we know that it is heavily dependent on the initialization, we have a method to have more coherent and more intuitive initializations. Each value of k is associated with an objective function we are trying to minimize, which reflects the intra-cluster variance. We swipe over a range of k, storing the objective function for each, which allows us to plot Objective function vs k.
This plot provides us with some insight as to which optimal value for k we should use. When each instance has a feature vector of length 150+, it is impossible to visualize the data and determine the number of clusters visually.
One benefit of using k-means is that we don't need to carry the dataset for classification. Once the cluster centers have been determined, that is all we need to classify a new instance (unlabeled user).
The last step is to perform dimensionality reduction to express the data with only the most relevant features. PCA, LSI in topic models are paths we will explore at that point.
Some issues we've had to deal with is getting around the restriction on the number of queries set by Twitter. A single authorization token (on Twitter for developers) will provide a maximum of 500 queries/hour. To get passed this, we created a bunch of authorization tokens, which we cycle through.
Here's what a single token authentication looks like, followed by a user timeline query:
require 'twitter'
require 'json'
require 'pp'
require_relative 'Token_Nico.rb'
YOUR_CONSUMER_KEY= "mnnH8LoWM7#########"
YOUR_CONSUMER_SECRET ="FTYr8xgRdTyMEACPEO9Jfxl##################"
YOUR_OAUTH_TOKEN = "23894652-NeIDQ4JeHMofJxldF#######################"
YOUR_OAUTH_TOKEN_SECRET= "UgvOuWpaTTnhKKLpiHz9##################"
@client = Twitter::Client.new(
:consumer_key => YOUR_CONSUMER_KEY,
:consumer_secret => YOUR_CONSUMER_SECRET,
:oauth_token => YOUR_OAUTH_TOKEN,
:oauth_token_secret => YOUR_OAUTH_TOKEN_SECRET
)
# get timeline from id
tw_data= @client.user_timeline( id.to_i, :count=>200, :exclude_replies=>false, :include_entities=>true)
'Token_Nico.rb' contains a class definition as follows:
class ClassTokenToken is an array of token arrays. The idea is to loop through the authentication keys until either we get the user timeline, or we determine that the timeline is protected (profile set to private).
def self.re_configure(i)
pp "reconfigure i=#{i}"
pp Token[i][:consumer_key]
@client = Twitter::Client.new(
:consumer_key => Token[i][:consumer_key],
:consumer_secret => Token[i][:consumer_secret],
:oauth_token =>Token[i][:oauth_token],
:oauth_token_secret => Token[i][:oauth_token_secret]
)
return @client
end
def self.howmany?
return Token.count
end
end
Sunday, October 21, 2012
Collecting data: Twitpic Scraper with Matlab
For my Machine Learning course this semester, the project I will be working on is Context-Based Object Recognition using Twitter. We would like to use the associated information about twitted images (author gender, hashtags, caption, comments etc) to try to improve recognition. Those extra features will be used as a prior, providing a context for the classification pipeline.
First step is to create our dataset. I wrote a Matlab script that scrapes the desired features from twitpic with a query for the keyword "pet". Here's a first result:
Original twitstream:
and collected data in Matlab (title of each subplot is the associated caption, which can be too long and overlap):
The data is stored in a structure that currently has 3 fields: {image, caption, hashtag}. Here's a more detailed view where we can see the extracted #hashtag:
Here's a link to the script I wrote:
One thing I have yet to address: it seems a lot of tweets are in non US-ASCII character set (for a "pet" query, a lot of the captions were in Japanese). So I will need to modify the script slightly to treat those tweets differently.
Edit: Older version of the script would crash if the visited profile was recently created (missing information such as post history) or if there was a video instead of an image.
Here is a more robust version of the Twitpic Matlab scraper. The user can now specify how many pages to scrape, and which tag to look for.
Tuesday, September 18, 2012
Update -
It's been some time since my last post. Many things have happened in the meantime:
A full blog is dedicated to the work I did their for their computer vision algorithm, but unfortunately it's confidential. A few "customers" have adopted "adversarial behaviors" and managed to extirpate money out of the kiosks, so ecoATM won't let me talk about their recognition system. The work I did addressed these issues though.
Taking all the graduate Computer Vision courses at UCSD my senior year played out nicely - I can now focus on ML here at Carnegie Mellon.
The university has been great so far - the curriculum features so many interesting classes, I feel like taking so many of them. There are tons of social, technical, recruiting events happen every day. I don't think I've cooked anything in the last 2 weeks - I've been eating for free by going to a few talks. The Job fair was insane: so many tech companies and startups showed up, all eager to recruit CMU students. I had an interview this very morning with Microsoft. I'd love to do something in Computer Vision with the Microsoft Research team! Maybe work on the next Photosynth?
Fingers crossed.
- I interned at ecoATM, in San Diego -
A full blog is dedicated to the work I did their for their computer vision algorithm, but unfortunately it's confidential. A few "customers" have adopted "adversarial behaviors" and managed to extirpate money out of the kiosks, so ecoATM won't let me talk about their recognition system. The work I did addressed these issues though.
- I graduated from UCSD!
- I moved to Pittsburgh for grad school!
Taking all the graduate Computer Vision courses at UCSD my senior year played out nicely - I can now focus on ML here at Carnegie Mellon.
The university has been great so far - the curriculum features so many interesting classes, I feel like taking so many of them. There are tons of social, technical, recruiting events happen every day. I don't think I've cooked anything in the last 2 weeks - I've been eating for free by going to a few talks. The Job fair was insane: so many tech companies and startups showed up, all eager to recruit CMU students. I had an interview this very morning with Microsoft. I'd love to do something in Computer Vision with the Microsoft Research team! Maybe work on the next Photosynth?
Fingers crossed.
Thursday, February 23, 2012
Display notes with MIDI
When jamming with new people who don't have a very sharp musical intuition, explaining what you are playing can really bring the spontaneity of the moment down.
Using MIDI messages and a simple serial communication system between my keyboard's MIDI out, an Arduino, and Processing, we can display notes being struck on the computer screen in real time, for everyone to see.
Here's what it looks like:
Here is the setup:
Files:
- Arduino Code
- Processing Code
Possible improvements:
- Use 3rd midi byte to extract velocity (pressure applied to the key) and change size of text accordingly. This will emphasize keys that are struck harder.
- Polyphony: recognize and display chords.
- Display only the bass line, since that carries the most information for improvisations.
Using MIDI messages and a simple serial communication system between my keyboard's MIDI out, an Arduino, and Processing, we can display notes being struck on the computer screen in real time, for everyone to see.
Here's what it looks like:
How it works:
MIDI messages are sent to the microcontroller over serial. The 3 message bytes are interpreted and a corresponding character is printed over serial, for the Processing program to read. The problem is that the Arduino can't read incoming bytes from the keyboard and print over the same serial channel. So we need SoftwareSerial to create another serial communication, over one of the digital pins (I am using 2).
Note: We need to establish the baud rate of the microcontroller to 31250 for MIDI communication.
Here is the setup:
Files:
- Arduino Code
- Processing Code
Possible improvements:
- Use 3rd midi byte to extract velocity (pressure applied to the key) and change size of text accordingly. This will emphasize keys that are struck harder.
- Polyphony: recognize and display chords.
- Display only the bass line, since that carries the most information for improvisations.
Monday, November 7, 2011
Linear Predictive Coding - Speech Compression
In this post, we will explore an approach to implementing audio compression for speech signal.
In order to speed up data transmission, the speech signal will be quantized to a certain bit rate, then sent over the channel and reconstructed. Everything here was done in MATLAB.
In order to speed up data transmission, the speech signal will be quantized to a certain bit rate, then sent over the channel and reconstructed. Everything here was done in MATLAB.
First step: we read a speech signal in. The signal is stored as a column vector y.
Quantization
The first task is to quantize our vector y by creating permissible levels for the magnitude of each element of y. Looking at y, we see that the magnitude varies extensively over the length of the signal:
Figure 1. Original Speech Signal y
Therefore, choosing a global quantization interval q for the entire signal will be detrimental to part of y. The resolution would be either too coarse for small variations, or too fine for the big spikes.
To perform a more effective quantization, we need to do it locally.
We segment y into blocks of length 160, which is small enough compared to the length of any speech signal to be considered ‘local’ (for comparison, 1 second of speech corresponds to roughly 9000 points).
To perform a more effective quantization, we need to do it locally.
We segment y into blocks of length 160, which is small enough compared to the length of any speech signal to be considered ‘local’ (for comparison, 1 second of speech corresponds to roughly 9000 points).
For each block, we determine the quantization interval as follows:
1. Determine mean of the block
2. Determine variance of the block
3. Threshold “outliers” that are outside the range Mean+-α*var where α allows the user to adjust which outliers to cut.
4. The quantization interval is the max –min divided by bit rate L=2^r.
5. The block is then quantized yq= round(y/q)*q
Below is a plot of the 50th block of y, with the mean and the range defined by mean and α*variance.
Figure 2. Mean, Mean+- α of 50th block of y
Below is the quantized result of the same block:
Figure 3. Quantized 50th block of y
The larger α is, the more of the original signal are preserved. The larger the bit rate r, the finer the quantization intervals are. |
The larger α is, the more of the original signal are preserved.
The larger the bit rate r, the finer the quantization intervals are.
The larger the bit rate r, the finer the quantization intervals are.
Therefore, choosing large α and r reduces distortion in the quantized signal.
In our case, taking α=3 and r=8 gave a reasonable audible result, and yielded an MSE between y and yq of 1.5743.
When the MSE approaches 4, the quantized signal is fairly hard to comprehend, and sounds snowy and muffled.
In our case, taking α=3 and r=8 gave a reasonable audible result, and yielded an MSE between y and yq of 1.5743.
When the MSE approaches 4, the quantized signal is fairly hard to comprehend, and sounds snowy and muffled.
One should keep in mind however that the MSE is a mathematical measure of the error between y and yq, and might not accurately reflect the human perception system.
The LPC model
The LPC model we use relies on the idea that an element y(i) can be approximated or predicted by a linear combination of the 10 previous y(i)’s. We see from our FIR equation that the sum can be rewritten in matrix format by using a toeplitz matrix A, with the appropriate coefficients y(i). Note that the negative index of y(i) are set to 0.
We want to test our model by finding the coefficients a(k) that minimize the least square problem. Hence we project y onto the space spanned by the columns of A and find the inverse to get x which contains the optimal coefficients a(k).
We want to test our model by finding the coefficients a(k) that minimize the least square problem. Hence we project y onto the space spanned by the columns of A and find the inverse to get x which contains the optimal coefficients a(k).
Because the space is real, the Adjoint A* is the transpose of A:
ATAx=ATy
Solving for x in matlab, we get the x which contains the optimal coefficients a(k).This is done locally for each block. Thus, Ax is the best approximation of y that is in the range of A; it’s the closest we can get to our original signal by linear combinations of previous y(i)’s.
Error
Observe that since Ax is lives in the space spanned by the columns of A, and y does not, we have an error vector e = y – A*x.
We can compute the error for each block, then concatenate to get a vector of the same length as y. We then verify that we get the same exact error vector e by implementing the FIR equation directly:
Indeed, the MSE between these two residuals vectors is negligible: 1.2730e-034. This is acts as a sanity check.
We can compute the error for each block, then concatenate to get a vector of the same length as y. We then verify that we get the same exact error vector e by implementing the FIR equation directly:
Indeed, the MSE between these two residuals vectors is negligible: 1.2730e-034. This is acts as a sanity check.
Since everything is in order, we can use the following equation:

in order to retrieve our original signal, since we have the optimal coefficients a(k) and the error e. Listening to the resulting y_reconstruct seems to confirm our reasoning: it sounds exactly like the original signal. The MSE between y and y_reconstruct is the final check: we get
MSE(y , y_reconstruct) = 5.3181e-037
Quantizing the residuals
Next, our goal is to reconstruct y this time using quantized error e. We see that the distortion introduced by quantization will prevent us from reconstructing the original y exactly.
We call the reconstructed signal y_hat.
Using
Using
where eq(n) is the quantized residuals.
We find that the quality of y_hat is perceptually very close to the original y. The MSE is found to be MSE3 = 0.0872 for α = 2 and r = 8. The speech is clear and audible.
The relationship between error and bit rate value r is plotted in the following figure:
Figure 4.a) MSE(y,yq) vs r. b) MSE(y, y_hat)
We see that the residual error decreases as r increases, which conforms to our earlier reasoning. Qualitatively, the reduced distortion as r increases is noticeable. The perceived quality of speech is appreciably better. When the number of quantization level is high, the speech signal is not forced to take certain values, and the effects of rounding off are less significant.
Figure 4. tells us that increasing r affects the MSE relatively weakly when r > 5. The MSEs remain relatively constant for any value of r >5.
Quantizing the residuals and the coefficients
Until now, only the residuals were quantized, with allowed for rather successful reconstruction of the original speech signal.
Now we want to also quantize the coefficients x = { a(k) }. Using the same process, we decide on a factor α = 2 and truncate the outliers. Given the order of process, quantizing x will yield a new error vector e, which needs to be quantized again. Then, using the AR filter described previously, we reconstruct a signal y_recon. We note that the error is much greater, because distortion are introduced at 2 stages of the process: when quantizing the coefficients x = {a(k)} and when quantizing the resulting e vector.
Figure 5. MSE(y,yq) & MSE(y,y_recon) vs r for α = 1, 1.5, 2.
The reason that motivated my choice of α = 2 is the following: in a speech recording, the large variations in amplitude are the most significant because they represent the formants that the human perception system can decode. Therefore, we want a quantization set by the large coefficients, which will indeed result in several small amplitudes to be rounded to 0. In that sense, it acts as a filter on the signal y. I believe this is what causes the low values of α to fail reconstructing an acceptable signal.
Decreasing the bit rate r yielded a segmented sound, similar to a voice heard through a fan. This is considerably reduced for r ≥ 4. This result is correlated by the graphs of MSE(y,y_recon) vs r.
It should be noted that α is particular to the context of this project: the choice of its value should differ if this process is to be applied to other types of signal. Speech has characteristic spikes which do not occur in other signals, hence that conditions our choice of α for cutting out the outliers.
I did not encounter any issues with AR filter stability. However, because AR filters are recursive, the feedback can generate instability.
The resulting reconstructed signal is definitely clear and audible, although it sounds different from the original file
Subscribe to:
Comments (Atom)

