The First Teeny Tiny Steps on Sentiment Analysis

Have been spending/struggling a great deal of time on studying movie reivew data and exploring Natural Language Processing (NLP) field, I encountered some typical problems of machine learning: imbalanced training and high dimensional/sparse features. Firstly, I will use the widely applied classifier in text classification circle to address imbalanced training problem and features selection/reduction/transformation in following posts.

For text classification, the most frequently used classifier is Naive Bayes which employs Bayesian inference regime to maximize posterior probability. However, since the priors are often treated as uniform and can be disgarding as constant during estimation, the maximum a posteriori (MAP) can be simplified as maximizing likelihood problem. That is, how likely the observed data fits the presuming model? (The inverse is log-likelihood ratio test problem to test how strongly model is supported by acquired samples).

Naive Bayes are favored in text classification circle for the following reasons:

  1. Simplicity: Parameters are estimated from the training samples. Due to the independency assumption among features (addressed below), only a small portion of samples is required for training. For example, Multinomial Naive Bayes in text classification field estiamtes event occcurrence from word count. In practice, pseudo-count $\alpha$ is often added to smooth under-sampling bin with zero count. Generally, adding one scheme or Laplace smoother would be generally good for scarce sample; while smaller $\alpha$ would get over-fitting result.
  2. Efficiency: Thanks to its simplicity, it is easy and fast to train.
  3. Accuracy: It works well for most practical applications such as spam classification and some shows it can out-perform other sophiscated training algorithms. (Rennie, 2003)

However, Naive Bayes also pays its own price for its naivity. It carries out an assumption which might not be suitable for most real life practice but still magically work well. This naive assumption is no dependent links among features are modeled when conditioning on class labels. In the text classification problem, that will be all words are indendent of each other and form a theoreticallty unlikely but practically workable model, Bag of Words. The magic behind this unlikely fact as pointed in (Zhang, 2004) is some negatively correlation would cancel out each others' influence as if they are indepently when summarizing as a whole.

The Rotten Tomato Movie review data downloaded from Kaggle website were already broken down into fine-grained phrases by Stanford parser and labeled at a 5 leveled-sentiment scale1. The sentiment scale, ranging from 0 to 4, represents negative, somehow negative, neutral, somehow positive and positive. There are in total 103,606 phrases from 8,544 full sentences. The lengths of full sentences are varying and as a result the phrase sizes of full sentences are also quite different. Each phrase has a sentiment score assigned to it. But most of phrases are too short to be associated with a meaningful sentiment score when taken out of context. And that might be the source of imbalance training problem. A quick view of sentiment class distribution in training data set. As you can see most of phrases are neutral (sentiment class equals to 2) and following by another two slightly negative and positive classes. Since negative (sentiment class 0) and positive (sentiment class 4) only contitute 10% of training data contrasting to the biggest neutral class which up to 50%. The training examples are unbalance.

The following figure shows sentiment class distribution against different cross-validation strategies provided by scikit-learn package: KFold (kfold in figure) and StratifiedKFold (skfold in figure). KFold is a regular cross-validator which randomly splits the training sample without considering the class distribution in sample. StratifiedKFold employs a stratified sampling strategy which ensures class distribution in each fold are close to whole sample. Regular KFold ignores the fact that small classes might not be selected in certain folds when dividing arbitrarily; while StratifiedKFold mitigates this problem by sampling from each classes. As demostrated in the following simulation, the problem for regular KFold would become more pronounced when the number of folds, k, gets larger or the sampling in KFold is biased. The steps of toy simulation are listed in the following:

  1. Draw 1000 samples from Binomial distribution by simulating a biased coin which has 0.98 probability for head and 0.02 for tail.
  2. Employ 5 folds KFold and StratifiedKFold on samples

However, there is no difference between different cross-validation strategies in our movie review training example. Therefore, changing cross-validation strategies might not improve overall prediction accuracy much during training stage.

from numpy.random import binomial
n, p, m = 1000, 0.98, 1; # binomial parameters n, p and m times simulations
N = binomial(n, p, m);
y = np.zeros((m,n), dtype = np.int);
num_cv = 10;
for i in range(m):
    y[i,0:N[i]] = 1;
    toykfolder = KFold(n, n_folds = num_cv, shuffle = True);
    toyskfolder = StratifiedKFold(y[i], n_folds = num_cv, shuffle = True);
    sent_dist = prepare_data(y[i], toykfolder, toyskfolder, cv_strategy, num_cv);

But would it be a good idea to randomly split phrases without considering they are originally part of the full sentence? Let's examine the sentiment class distribution conditioning on each full sentence:

The below figure shows the pooling result of sentiment label distribution given the sentiment label of full sentences. Unsurprsingly, the dominating phrase label is the label identical to the one of full sentences. And the tendency of dominance decreases or increases when the sentiment label of full setences becomes more positive or negative. The correlation measured by Pearson $\chi^2$ shows this grouping is significantly not random. However, this is what we expect to see; otherwise, we would have difficulty to deduce sentiment of full sentence from its phrase compositions. But how strong using the composition of sentiment labels alone can infer the sentiment label of full sentence?

	Pearson's Chi-squared test

data:  freq_tbl
X-squared = 3813.2, df = 16, p-value < 2.2e-16

Above figure is optimistic in a naive way: the individual phrase label distributions are grouped with the same sentiment class. It only shows the overall tendency when seeing the sentiment label of full sentence. However, each sentence has its own sentiment distribution for consisting phrases. If we plot sentiment distribution of phrases for each sentence, we should see multiple curves wide spread but centering at the location of each bar. I tried to nail down the relation of sentimet labels between full phrases and its member phrases in the following figure.

You can treat the figure below as a matrix: the columns is the sentiment labels for full sentence and the rows are the sentiment labels for phrases. For each entry (i, j), it represents the histogram of the propotion phrase sentiment labels, i, given the sentiment of full sentence, j. The propotion ranging from 0 to 1, denoting the value in the x-axis in each entry, as the density as the value in the y-axis. If you sum along the row (or axis 0 in python numpy lingo), you will get the above figure. And indeed, except the neutral sentiment class (numbered 2), the rest of sentiment labels have similar trends coincide with the above figure. You can see a small peak in the high proportion end if you look at the main diagonal from upper left to the lower right. This corraborates the above finding.

Unfortunately, there is a prominent peak locating at low percentage also. This peak might address the close number for sentiment labels other than the one of full sentence in the above figure. One can see transition from one sentiment label to its close neighbor, such as negative to somehow negative or somehow positive to positive. This implies predicting the sentiment label of full sentence based on majority might be helpful in some cases but not most of them.

This figure poses an alternative view from figure 2 in Socher et al.'s study (Socher, 2013). In their figure, they showed distribution of sentiment labels conditioning on N-gram instead of sentiment labels of full sentence. You could draw a line (the upper yellow dashed one) across the high percentage region and the other across the low percentage as corresponding to the two peaks in above figure. And realized that the prominent peak in the above figure might mainly consist shorter phrases or smaller N where neutral sentiment label dominates (as you could see the porminent peak in the neutral labels is acutally not locating low percentagle but somewhere medium close to 0.5). This observation contradicts against the intution in Pang & Lee' s paper (Pang & Lee, 2013). They proposed the idea by using regression to capture the continuity for sentiment rating. However, roughly counting positive-sentence percentage as they proposed will not count in the effect of transitional phrases such as high level negation and contrastive as captured by Socher et al's parsing tree method. Some non-deterministic factor related to random assignment of neutral sentiment should be counted in as well.

Another trivial point is that unbalance data will no longer a problem for training for larger N as they distributes almost uniformly. Based on the big data principle and for the sake of saving all potential training examples, a more sophiscated training scheme might be needed. Would the popular deep learning give a helpful hand?

More Later!

Comments

Comments powered by Disqus