In this post, we will explore an approach to implementing audio compression for speech signal.
In order to speed up data transmission, the speech signal will be quantized to a certain bit rate, then sent over the channel and reconstructed. Everything here was done in MATLAB.
In order to speed up data transmission, the speech signal will be quantized to a certain bit rate, then sent over the channel and reconstructed. Everything here was done in MATLAB.
First step: we read a speech signal in. The signal is stored as a column vector y.
Quantization
The first task is to quantize our vector y by creating permissible levels for the magnitude of each element of y. Looking at y, we see that the magnitude varies extensively over the length of the signal:
Figure 1. Original Speech Signal y
Therefore, choosing a global quantization interval q for the entire signal will be detrimental to part of y. The resolution would be either too coarse for small variations, or too fine for the big spikes.
To perform a more effective quantization, we need to do it locally.
We segment y into blocks of length 160, which is small enough compared to the length of any speech signal to be considered ‘local’ (for comparison, 1 second of speech corresponds to roughly 9000 points).
To perform a more effective quantization, we need to do it locally.
We segment y into blocks of length 160, which is small enough compared to the length of any speech signal to be considered ‘local’ (for comparison, 1 second of speech corresponds to roughly 9000 points).
For each block, we determine the quantization interval as follows:
1. Determine mean of the block
2. Determine variance of the block
3. Threshold “outliers” that are outside the range Mean+-α*var where α allows the user to adjust which outliers to cut.
4. The quantization interval is the max –min divided by bit rate L=2^r.
5. The block is then quantized yq= round(y/q)*q
Below is a plot of the 50th block of y, with the mean and the range defined by mean and α*variance.
Figure 2. Mean, Mean+- α of 50th block of y
Below is the quantized result of the same block:
Figure 3. Quantized 50th block of y
The larger α is, the more of the original signal are preserved. The larger the bit rate r, the finer the quantization intervals are. |
The larger α is, the more of the original signal are preserved.
The larger the bit rate r, the finer the quantization intervals are.
The larger the bit rate r, the finer the quantization intervals are.
Therefore, choosing large α and r reduces distortion in the quantized signal.
In our case, taking α=3 and r=8 gave a reasonable audible result, and yielded an MSE between y and yq of 1.5743.
When the MSE approaches 4, the quantized signal is fairly hard to comprehend, and sounds snowy and muffled.
In our case, taking α=3 and r=8 gave a reasonable audible result, and yielded an MSE between y and yq of 1.5743.
When the MSE approaches 4, the quantized signal is fairly hard to comprehend, and sounds snowy and muffled.
One should keep in mind however that the MSE is a mathematical measure of the error between y and yq, and might not accurately reflect the human perception system.
The LPC model
The LPC model we use relies on the idea that an element y(i) can be approximated or predicted by a linear combination of the 10 previous y(i)’s. We see from our FIR equation that the sum can be rewritten in matrix format by using a toeplitz matrix A, with the appropriate coefficients y(i). Note that the negative index of y(i) are set to 0.
We want to test our model by finding the coefficients a(k) that minimize the least square problem. Hence we project y onto the space spanned by the columns of A and find the inverse to get x which contains the optimal coefficients a(k).
We want to test our model by finding the coefficients a(k) that minimize the least square problem. Hence we project y onto the space spanned by the columns of A and find the inverse to get x which contains the optimal coefficients a(k).
e = y – A*x ┴ R(A)
Because the space is real, the Adjoint A* is the transpose of A:
ATAx=ATy
Solving for x in matlab, we get the x which contains the optimal coefficients a(k).This is done locally for each block. Thus, Ax is the best approximation of y that is in the range of A; it’s the closest we can get to our original signal by linear combinations of previous y(i)’s.
Error
Observe that since Ax is lives in the space spanned by the columns of A, and y does not, we have an error vector e = y – A*x.
We can compute the error for each block, then concatenate to get a vector of the same length as y. We then verify that we get the same exact error vector e by implementing the FIR equation directly:
Indeed, the MSE between these two residuals vectors is negligible: 1.2730e-034. This is acts as a sanity check.
We can compute the error for each block, then concatenate to get a vector of the same length as y. We then verify that we get the same exact error vector e by implementing the FIR equation directly:
Indeed, the MSE between these two residuals vectors is negligible: 1.2730e-034. This is acts as a sanity check.
Since everything is in order, we can use the following equation:
in order to retrieve our original signal, since we have the optimal coefficients a(k) and the error e. Listening to the resulting y_reconstruct seems to confirm our reasoning: it sounds exactly like the original signal. The MSE between y and y_reconstruct is the final check: we get
MSE(y , y_reconstruct) = 5.3181e-037
Quantizing the residuals
Next, our goal is to reconstruct y this time using quantized error e. We see that the distortion introduced by quantization will prevent us from reconstructing the original y exactly.
We call the reconstructed signal y_hat.
Using
Using
where eq(n) is the quantized residuals.
We find that the quality of y_hat is perceptually very close to the original y. The MSE is found to be MSE3 = 0.0872 for α = 2 and r = 8. The speech is clear and audible.
The relationship between error and bit rate value r is plotted in the following figure:
Figure 4.a) MSE(y,yq) vs r. b) MSE(y, y_hat)
We see that the residual error decreases as r increases, which conforms to our earlier reasoning. Qualitatively, the reduced distortion as r increases is noticeable. The perceived quality of speech is appreciably better. When the number of quantization level is high, the speech signal is not forced to take certain values, and the effects of rounding off are less significant.
Figure 4. tells us that increasing r affects the MSE relatively weakly when r > 5. The MSEs remain relatively constant for any value of r >5.
Quantizing the residuals and the coefficients
Until now, only the residuals were quantized, with allowed for rather successful reconstruction of the original speech signal.
Now we want to also quantize the coefficients x = { a(k) }. Using the same process, we decide on a factor α = 2 and truncate the outliers. Given the order of process, quantizing x will yield a new error vector e, which needs to be quantized again. Then, using the AR filter described previously, we reconstruct a signal y_recon. We note that the error is much greater, because distortion are introduced at 2 stages of the process: when quantizing the coefficients x = {a(k)} and when quantizing the resulting e vector.
Figure 5. MSE(y,yq) & MSE(y,y_recon) vs r for α = 1, 1.5, 2.
The reason that motivated my choice of α = 2 is the following: in a speech recording, the large variations in amplitude are the most significant because they represent the formants that the human perception system can decode. Therefore, we want a quantization set by the large coefficients, which will indeed result in several small amplitudes to be rounded to 0. In that sense, it acts as a filter on the signal y. I believe this is what causes the low values of α to fail reconstructing an acceptable signal.
Decreasing the bit rate r yielded a segmented sound, similar to a voice heard through a fan. This is considerably reduced for r ≥ 4. This result is correlated by the graphs of MSE(y,y_recon) vs r.
It should be noted that α is particular to the context of this project: the choice of its value should differ if this process is to be applied to other types of signal. Speech has characteristic spikes which do not occur in other signals, hence that conditions our choice of α for cutting out the outliers.
I did not encounter any issues with AR filter stability. However, because AR filters are recursive, the feedback can generate instability.
The resulting reconstructed signal is definitely clear and audible, although it sounds different from the original file