As previous put, the information content to classify the root sentiment label based on the composition of it children labels decreases with increasing levels in parsing tree (the depth from root to the child node). However, in order to qualify the importance of tree features (not just quantify), decision tree will be used to measure how important level and label at children node as joint features to classify root sentiment label. Firstly, I would like to assume syntatic structure of each sentence is well captured by the parsing trees of highest statical significance; therefore suffice to use the most significant one for study. By assuming this, the uncertainty of children labels due to parsing tree construction could be greatly excluded and the true labels assigned through Amazon Turfs crowd work will be used. By doing so, an emperical upper bound of root sentiment classification error based on level and label joint features can be estimated. In the following subsequent work, I will replace children labels with random predictions to obtain a lower error bound of how the uncertainty of children label predictions introduced into classification framework and decrease the accuracy.
Why decision tree? Decision Tree is known for better interpretability and quick construction with minimal preprocessing effort. For current task, it has additional advantage which is employing entropy as criterion to derive the node splitting rule. In order to examine the relation between sentiment labels and associating levels, I use the co-occurrence count for level and label pair for all children nodes to form joint features per training example. The distribution of sentiment labels and levels can be seen in previous posts.
The first image is the partial decision tree constructed for root sentiment label classification task. From the truncated tree image below, one can roughly see that postive sentiment (pos+ and pos) classes are on the right and possessing nested structure at the higher level; while negative sentiment (neg- and neg) on the left also with nested structure. The topmost node is neural dominating which is also reasonalbe since neural sentiment might hold some fuzziness in human perceptron lacking precise instrumental measurement. This tree is subject to change due to the greedy natrue of decision tree. Therefore, ensemble-based method will be used to construct more stable model and applied to study feature importantce.
Feature importances taken from two ensembel strategies (resampling and boosting). Resampling or random forest-based ensemble strategy provides an alternative way to tackle overfitting problem. Resampling-based method can avoid overfitting by randomly leaving out some features or training examples. Resampling-based classifier is akin to what we did in previous post but with more rigorous algorithm implementation and standard performance mesurement (zero-one loss for classification). Here, I use ExtraTreeClassifier implemented by scikit-learn. ExtraTreeClassifier is a extremely randomized tree ensemble method. In this method, not only features are randomly chosen to perform tasks but also the threshold used to split tree. By doing so, training will not be constrained to the training examples at hand and training result will not be difficult to be generalized. You can find more details about ExtraTreeClassifier in scikit-learn website.
While resampling-based strategy provides population view through boostrapping estimation and taking stochasticity into consideration, it is also worth studying boost-based strategy which unlike resampling-based is a deterministic approach to fit training examples in multiple stages with increasing weights. GradientBoostingClassifier is a good representative of classifiers in such kind. It uses gradient-based optimization without further fiddling features scale. Again, the implementation provided by scikit-learn is used.
The feature importance is shown below along with the marignal distribution for both levels and labels at children nodes. Both of them are grouped and colored by their root sentiment label. The value of feature importance is proportional to the area of circle. One can see from the feature importance plot, more important features are close to the root and most of them are at the level smaller than 5. This is especially prominent for feature importantce computed through boost-based classifier. The feature importance derived from gradient-based ensemble learner is akin to what would be derived by single decision tree. This is expected due to the learning process for boosting is constrined to the initial classification result and continues making progress by learning other difficult cases in each subsequent stage (fitting the residuals). Boost-based learner spent most of time on lower levels for each class. This might indicate a mixture distribution formed by various tree sizes and as a result it creates additional noises for boost-based learner to fit at once. In constrast to the boost-based learner, resampling-based learner distributes importance across wider range of levels. The reason might be random resampling encourages to explore other solution spaces without fixing its optimizer within a neighborhood of local optimum.
Other possible insights from importance plot is merging greater levels into smaller range might not be helpful. The cutoff of maximal level to be retained should be chosen to strike the balance where the power of discernability does not degrade due to pooling greater levels and the model complexisity is reduced to avoid overfitting.
Furthermore, we will compare resampling-based ExtraTreeClassifier and boost-based GradientBoostingClassifier in the learning curve setting. The result is shown in the figure below. In addition to the common learning curve practice where data will be cross-validated (the first row of figure below), we also want to know how the performance of classifier changes as the data are trained in a specific order (the second row of figure below).
Firstly, we can see that resampling-based classifier (green) outperforms single decision tree (blue) without being crippled by the high and sparse dimensional features. It also exhibited robustness to overfitting in larger size of training examples as expected. Resampling-based classifier also showed smaller fluctuations than greedy-based tree classifier because the stochasticity resulting from local optimal splits is smoothed out through larger number of estimators used (50 estimators used). One thing worth noticing is that the learning curves plotted against training set present here are not well-behaved or non-decreasing as one would expect a good learning algorithm might do. Many factors could result in not well-behaved learning curve especially for algorithms which rely on greedy solution or randomn resampling. Even if learning curve is not well-behaved, one can still tell the overall trend where accuracy increases with larger training size except boost-based classifier. Boost-based classifier shows overfitting across all numbers of training size. However, overfittig is gradually alleviated as the number of training examples increases. Both boost- and resampling-based classifiers have similar performance on cross-validation experiment but boost-based slightly better.
Secondly, we would like to examine how the classifiers perform when the training examples are present in a specific order. The order of training examples present to classifiers is created based on their sizes and trees of greater size are added into training set as the number of training examles increments. No additional random data partition will be conducted and classifiers built on different training sizes will be tested against the same validate set. By doing so, the randomness of data partition can be eliminated from curve construction. Any deviation from well-behaved learning curve will be mainly attributed to the algorithm itself. To furhter smooth out the randomness in algorithms, ten similar runs are performed with different seeds to get the average and standard deviation. Notice GradientBoostingClassifier and ExtraTreeClassifier scikit-learn implementations have no incremental fitting (no partial_fit method) and previous training result is not incorporated into the training with greater sizes (
warm_start=False). Further, no subsampling is specified in GradientBoostingClassifier and full batch gradient descent is used.
The overall trend of three classifers in the second expriment is similar to the random data partition experiment but with greater standard deviation. Though not well-behaved, the training and validate curves of decision tree are climbing up with a few ups and downs. When comparing the random data partition setting with presorting setting, one can find that boost-based classfier is impacted more than resampling-based from both plots. Firstly, the overfitting is slightly improved as training with presorting order. In the presorting order, boost-based classifier could achieve better validation accuracy with smaller number of training size (roughly estimated ~11% average difference between training and validation accuracy for 2200+ training size in presorting and 6190 in not presorting). This is almost as good as resampling-based one. This might furhter imply the possiblity of finding a good descending direction in early training stage if classifier are trained based on certain curriculum. Another interesting observation is boost-based classifier shows a valley both in training (shallow and flat) and validation (deep and sharp) curve around 1000 - 2000 training examples while resampling-based doesn't. This might indicate possible discontinuity existing in loss surface likely due to specific features missing out in this batch of trarining examples.
Finally, we should examine the original problem to see if the information present here can be used in actual prediction task. In the following experiment, a stacking-like two-tier classifier is constructed. The top most classifier still use ensemble-based classifiers (ExtraTreeClassifier and GradientBoostingClassifier) as in previous experimental settings. The bottom layer is multiple dummy classifiers which do not "learn" the rule from word features but arbitrarily assign random guess based on the prior given. The actual implementation of dummy classifier is adopted from scikit-learn DummyClassifier passing "stratified" as strategy. Contrary to the scikit-learn original implementation, the non-root label priors are not estimated from given training examples but solely depending on the given prespecified priors. By introducing a dummy classifier at the bottom level, we can examine how two classifiers handle imperfect predictions when comparing the perfect ones shown in above experiments. The multiple dummy classifiers are constructed for each level and used to predict the sentiment label of non-root nodes at that level. We also set a maximal level to construct predictors for non-root sentimental labels since finer grained division might not be able to have sufficient training examples. However, only the number of predictors are constrained to the maximal level, the full number of levels is still used as joint features with predicted non-root sentimental labels. For the reason, it might give additional power to predict root sentiment. Here, we use 8 as maximal level for number of label predictors (this is an arbitrary choice).
In order to create some worse scenarios, six pre-specified priors are given to the dummy classifier. They are "uniform", "neg- dominating", "neg dominating", "neutral dominating", "pos dominating" and "pos+ dominating". The dominating class would have as high as 0.8 belief in pre-specified prior distribution. The prior distributions specified here are all arbitrary choices and being here for the reason to demonstrate how boost and resampling-based adjust their importance in order to optimize the root sentiment classification.
As you can see in the following plot, the accuracy for non-root sentimental label prediction is plotted against level for each prior settings. In each setting, boost-based is plotted as the normal way but resampling-based is upside-down. The target accuracy, root sentiment (or accuracy at level 0), are all close. (Around 20% or close to random guess with 1/5 chance for each class and 28% for boost). Accuracy as such shows how both classifers at top-tier are robust to the performance of bottom-tier label predictors especially most of them are intentionally biased towards favoring the incorrect labels (by giving wrong belief). In order to further examine how two classifiers handle noisy predictions, feature importance across all levels is plotted in stacking bar chart for each setting. The contribution of predicted sentiment at non-root levels can be seen in each bar with different color. For levels greater than the maximal number of predictors whose feature importance will be summed with the one at maximal level allowed to construct predictors. As a result, the high importance of the last level is actually a collective quantity from the levels greater than the maximal level allowed to construct label predictors.
Two classifiers show similar trend for all six priors settings except boost-based has higher accuracy than resampling based one. This is innate ability for boost-based classifier which is designed for boost performance from a few weak learners. And it did to give ~28% accuracy which resampling-based is pale to comparison. But the performance of classifier in individual experiment settings can overall divided into two categories. One is the label predictors consistently outperform the root sentiment predictor such as in neutral dominating settings. The other is all the label predictors performs worse or no better than the targeting root sentiment predictor and the other five prior settings are falling into this category. As you can see the trends in these two categories are inverse, predicting neurtral sentiment correctly is not quite useful as much as giving overall low prediction accuracy across all levels as neg- dominating setting (but not for pos+).
For the feature importance part, two classifiers shows importance shifting from the smaller level to the middle level when predicted labels are used. This is especially true for feature importance reported by boost-based classifier where the first two levels are no longer dominating and employs resampling-like straregy to distribute importance evenly. In addition to the importance shifting, one can see the feature importance reflects what biased belief provided by the prespecified priors. For example, in the pos+ dominating prior setting, all level would prefer to give pos+ more importance. So do the rest four except uniform. Uniform prior setting is evenly distributing importance in all sentiment labels but have different preferences in some specific levels. To get a closer view about our biased priors affect non-root sentiment predictions. Below is a plot of AUC (area under receiver operating characteristic curve) against levels. Each column is the AUC of that titled non-root sentiment prediction result of GradientBoostingClassifier. Using AUC is because it is not heavily biased by the false positive and would not give misled conclusion. As you can see in the following figure, only neg- sentiment label is heavily benefited from same dominating prior and explain the better performance for boost-based classifier. This might indicate the combat strategy employed by two classifiers is just looking for useful pattern in the biased predictions and perhaps consistent bad predictions still make a useful pattern for root sentiment predictions.
Another view to see how boost-based and resamping-based classifier handle imperfect labels, we further present the confusion matrices for the root sentiment classification with perfect and predicted non-root sentiments.
As you can observe from confusion matrices for two classifiers with perfect labels (first row) and best performance from six prespecified prior settings (second row, they ar neg dominating for ExtraTreeClassifier and neg- dominating for GradientBoostingClassifier). Two classifiers lose their true-positive counts for each root sentiment class and redistribute them by two different strategies. Resampling-based classifier tend to distribute its predictions to all classes on average with the price of low sensitivity (high false negative) but high precision. On the other hand, boost-based classifier is heavily biased by the wrong belief and fooled by the prespecified priors in order to achieve a low precision (high false positive) but high sensitivity result.
The study present here is a partial and incomplete work for the purpose to understand better for how to handle data. The questions asked within the general machine learning context is how higher level mapping (tree node to sentiment label) can be used as features and how these labels can degrade the final prediction goal when they are corrupt. When more applications incorporate label embedding as part of learning scheme such as object recognition in computer vision, this study might show that additional measure needs to be taken as the fact poor generalization for deep neural network exists. The questions asked within the specific domain such as sentiment predictions are how incorrect and noisy sentiment labels should be expected from crowdsourcing work when there is no cheap way to measure human sentiment precisely. However, handling the imperfect "ground truth" is part of a statistician's daily life. To make machine learning actually work in the real life, such expectation should be considered when designing training framework.
In this study, we have a rough upper bound of the root sentiment prediciton accuracy given joint features formed by highly confident parsing tree and perfect non-root sentiment label. We also have lower bound when the arbitrary non-root sentiment are assigned. But how much improvement would be made when word embeddings are incorporated in root sentiment predictions?