# On The Second Thought ...

As mentioned in previous post, the commonly used baseline classifier, Multi-nomial Naive Bayes, based on Bag of Words assumption, has a corresponding graphical model interpretation. This graphical model has one root connecting to children nodes. These children nodes, also leaves in the graphical model, representing conditionally independent word features to the root, the sentiment label. (Zhang H, 2004). The naivity of this assumption fails to fulfill the real world applicaitons. To further step forward from this base model, the connections between different features or edges need to be built on Naive Bayes graph model. In addition to relaxing conditionally independent assumptions, the shift from treating word as discrete count to continuous vector will be another feast to get closer real world model. However, considering the monstrous dimensionality constructed from sparse vocabulary space in natural language, building connections is an intimidating work due to computation overhead. Using binary parsing tree as in Socher et al.'s work is a clever way to limit search space with the syntactic order constrain (Socher et al, 2013). Using latent semantics analysis such as truncated SVD to remap word count into continuous vector space will be explored in the following posts.

For here, I would like to know the information gain when the word features are grouped in a hierarchical structure based on a parsing tree. Collective predictions of phrases are arranged as non-terminal or terminal nodes of a parsing tree. Providing a credible parsing tree, children nodes refer either terminal (leaf nodes, as in uni-gram based model) or non-terminal nodes (internal nodes corresponding to selective n-gram phrase, where n is greater than one). The sentiment label of root node will be the target prediciton, the sentiment label of the full sentence. All the nodes of a parsing tree not only assocaites with n-gram phrase but also a sentiment label. In order to study structure information, I firstly incorporate the height from root to the children nodes as additional features (denoting as level through the post). Some descriptive statistics of levels are plotted in the following figure:

In order to realize how the recursive power relates to the levels. I will do the following:

- Use the collection of trees provided by Stanford Sentiment Treebank and compute the levels for each node
- Group the sub-trees based on the sentiment labels of root label (the sentiment label of full sentnece)
- Calculate the corresponding informatic terms (conditional entropy, mutual information or KL divergence etc) based on Bayesian inference.
- Study the correlation of levels with the statistics calcualted from 3.

Different informatic measures are used to see how discriminate power varies with the increase the levels. To further test with different hypotheses, I employed different resampling strategies serving as null models based on different assumptions potentially holding between root-children nodes. Firstly, I used parametric bootstrap (sampling with replacement) to generate random children labels in order to break the root specific children distribution. I would assume that the observed distribution agree with random resample if the conditional entropy of random resample makes no difference from the observed one.

For generating this resample, I assume the children labels would be drawn from multinomial distribution^{1}. In order to know how different children distributions could shift the information content when the assumption is not captured by the resamples, I experimented with uniform distribution as uninformative flat priors. I would compute the conditional entropy on chilren and level or root level. The result is shown in the following graph:

The topmost graph in the above figure is showing divegence comparing observing sample to uniform resamples. The curve is consistent with the histogram of levels. The divergence computed by comparing joint distribution of root and children given level is shown in red dashed line. The divergence of joint distribution can be further divided into two summation terms by chain rule: divergence comparing **conditional probability of root given level (light blue, labeled as children|root)** and **conditional probability of children given root and level (light green, labeled as root)**. In this partition, the contribution of divergence mainly comes from the distribution of children given root and level. This contribution can be furhter verified by the second graph, plot of average condtinal entropy per level. And the contribution from the root given level is nomial due to there is no perturbance in root labels. The greater divergence can be explained by uniformly re-distributing the children levels. As a result, root preference present in observing sample is eliminated in uniform distribution of children and introduce higher uncertainty and higher entropy to distinguish root labels.

More interestingly, the divergence of joint distribution can be further partitioned from the view of children side. Therefore, this quantity can be divided into two summationi terms: **conditional probability of children given level (dark turquoise, labeled as children)** and **conditional probability of root given children and level (dark green, labeled as root|children)**. The divergence contrasting condtitional distribution of children given root per level starting from a really small quantity and eventually diniminshing around level 9-11. This can further verified by the plot of average conditional entropy of root given children and level shown in bottom left graph. Meanwhile, the sub-term contrasting distributions of children per level showing large contribution to overall divergence of joint distribution due to the uniform assumption is not held for children label distribution per level as you can compare the % sentiment of children per level in the bottom right graph with observing sample in the first figure. Unfortunately, the conditional probability of root given children which is more relevant to prediction is indistinguishable from random when the level grows larger. A plausible explanation is probably the prevalence of neutral labels in children distribution. At the smaller levels, neutral labels are restricted to the neutral root; while at the greater levels, neutral labels are equally present in all sentiment classes since the nodes at greater levels are mostly leaves and shorter n-grams.

The next thing I did is to permute the root labels across levels while holding the children distribution unvarying. For doing this, I can still retain the tree structures without re-construction parsing trees (in the sense that no needs to reconstruct tree levels)^{2}. If there is root label specific effect on levels, then the random curve generated by permutation result will act quite different from the observing ones. The result is shown in below:

As shown in Figure 2, the topmost panel is showing the divergence when comparing joint probability of observing sample with permuting resample. Nevertheless, if you plot the informatic terms for observing and resamle generated by permuting within levels individually, the trends are almost identical. As you might wonder re-sampling children nodes and permuting root labels might result in breaking correlation between root and children labels, only re-sampling children shows disparate trend across levels from observing samples. Surprisingly, redistributing children labels will have greater effect on root label inference. However, this relation is not symmetric as the root labels are uniformly redistributed for each level (graph at bottom right). At this point, you might deduce the possbility that the sentiment labels of children are more related to the tree structure. But why? Because examing correlation assumption alone is insufficient.

The root preference existing in observing sample still presents in permutation resample. It just no longer favors the same label but any other root label. In the observing sample, the root preference can be shown as root-children matrix with the greatest values along main diagonal due to the root preference for same labels. Shuffling root labels will break this trend and root-children matrix does not have perfect matching showing in main diagonal but scattering at off diagonal entries (as shown in the Figure 4, the entries at diagonal for the observing sample are shifting to off diagonal in permute resample. Uniform resample children labels barely shows any preference). As a consequence, random correlation but no positive way will provide no entropy term when treating in average. Even though permuting root cannot break correlation as resampling children labels, we still can see deviation between the observing and permuting resample and decrease as the level increase when comparing them together with divergence.

In order to examine the independency between children and root label distributions, I generated another resample not on uniform priors but on emperical priors estimated from observing children levels. The resample was only conditioning on root label but not levels for the purpose of breaking the level dependency assumption (percentage of children labels per level is shown lower right graph). In the below figure, mutual information across levels was firstly plotted in the top panel. An expectant trend would be a curve similar to the poisson distribution of large mean. That is a curve of a peak centering somewhere close to the smaller level but large enough to provide discerning power from the random neutral phrases and of a long tail for the diminishing probability when levels gets larger. The resampling based on the level independency assumption shows the expected trend (blue dashed line, top panel); while the observing sample doesn't (red dahsed line, top panel). In fact, the observing sample shows the high dependency (high mutual information content) at the smaller levels; where neutral sentiment classes are acting totally different from the rest of sentiment classes. As you can compare mutual informationi per sentiment classes for observing (middle left top) and resample (middle left bottom). One thing worth noticing is that even if the mutual information looks quite different in observing sample and resample, the divergence is quite small when comparing the joint distribution of observing and resample (bottom) but the overall trend is consistent with mutual information plot (top). This might imply comparing mutual information would have higher capability over joint probability of observing and null model.

As you could see from the analysis of this post, co-varying information content with levels should be considered as well. With the additional level information, it adds nicer discriminative power which cannot be accomplished by counting sentiment labels of phrase alone. And this power decreases when the level is greater than 5 and diminishes around 17. In other words, we needs a stronger factor than level to correlate children sentiment label distribution with parsing tree. However, as the conditional entropy , tree level plus children sentiment labels are not good feature to predict root label. In fact, this study is incomplete in many ways including no exhaustive investigation about the correlation between sentence size and level (Obviously, they are correlated, but how?). Also what distribution we should treat sentiment values? A real-value or discrete? dependent or independent? How do labels of individual phrases constitute the overall sentiment value as they construct the whole sentence? And when would adding more phrases not help to predict the overall sentiment values? What cutoff we should choose for different applications when conducting a sentiment analysis? Could throwing more data and employing more sophisticated machine learning algorithm untangle these questions? Again, we might enlist the popular deep learning for help.

*More later!*

^{1} Multinomial distribution might not be suitable for modeling distribution of children labels since they might contain other children nodes in the recursive sense. However, as Naive Bayes (Bag of Word) has demonstrated the simplicity and pragmatic applicability, it would be a feasible model though not ideal.^{2} This is not quite the same as most bootstrap technique employed in validating phylogenetic tree construction due to their interest is to have statistically significant tree by shuffling sequences (usually DNA sequences) and reconstruction; while, reconstructing parsing trees are expensive and might yield a bunch of syntactically infeasible trees and therefore inflating the power. So a simpler way to work around this, probably just simply permute the root labels. FYI, you could try to permute across and along the levels as I did secretely, but you could only get one message from these analyses: **"Please don't mess around tree structure."**

## Comments

Comments powered by Disqus