As previous put, the information content to classify the root sentiment label based on the composition of it children labels decreases with increasing levels in parsing tree (the depth from root to the child node). However, in order to qualify the importance of tree features (not just quantify), decision tree will be used to measure how important level and label at children node as joint features to classify root sentiment label. Firstly, I would like to assume syntatic structure of each sentence is well captured by the parsing trees of highest statical significance; therefore suffice to use the most significant one for study. By assuming this, the uncertainty of children labels due to parsing tree construction could be greatly excluded and the true labels assigned through Amazon Turfs crowd work will be used. By doing so, an emperical upper bound of root sentiment classification error based on level and label joint features can be estimated. In the following subsequent work, I will replace children labels with random predictions to obtain a lower error bound of how the uncertainty of children label predictions introduced into classification framework and decrease the accuracy.
On the road towards advancing python programming it's inevitable to learn how cpython interperter works and extend interpreter's functionality through python C api. However, the biggest difficulty is to know where and how to start especially when interpreter crashes. The online courses, CPython Internals, taught by Philip Guo, gave brief background of computational theory behind the interpreter as a start to bridge what has been in practice.
However, this course is taught with python 2.7. Some pieces are missing or not clarified in this mini-series lecture. Firstly, python 3.x diverges greatly from python 2.7 not only the python syntax itself but also the corresponding c api. Some materials of this course is slightly inapplicable to latest python version. Secondly, this series simply touches the surface of python c api and doesn't have sufficient time to dig deeper. I would like to fill up the gap and hopefully would gain some feedbacks from real cpython experts.
As mentioned in previous post, the commonly used baseline classifier, Multi-nomial Naive Bayes, based on Bag of Words assumption, has a corresponding graphical model interpretation. This graphical model has one root connecting to children nodes. These children nodes, also leaves in the graphical model, representing conditionally independent word features to the root, the sentiment label. (Zhang H, 2004). The naivity of this assumption fails to fulfill the real world applicaitons. To further step forward from this base model, the connections between different features or edges need to be built on Naive Bayes graph model. In addition to relaxing conditionally independent assumptions, the shift from treating word as discrete count to continuous vector will be another feast to get closer real world model. However, considering the monstrous dimensionality constructed from sparse vocabulary space in natural language, building connections is an intimidating work due to computation overhead. Using binary parsing tree as in Socher et al.'s work is a clever way to limit search space with the syntactic order constrain (Socher et al, 2013). Using latent semantics analysis such as truncated SVD to remap word count into continuous vector space will be explored in the following posts.
Have been spending/struggling a great deal of time on studying movie reivew data and exploring Natural Language Processing (NLP) field, I encountered some typical problems of machine learning: imbalanced training and high dimensional/sparse features. Firstly, I will use the widely applied classifier in text classification circle to address imbalanced training problem and features selection/reduction/transformation in following posts.
For text classification, the most frequently used classifier is Naive Bayes which employs Bayesian inference regime to maximize posterior probability. However, since the priors are often treated as uniform and can be disgarding as constant during estimation, the maximum a posteriori (MAP) can be simplified as maximizing likelihood problem. That is, how likely the observed data fits the presuming model? (The inverse is log-likelihood ratio test problem to test how strongly model is supported by acquired samples).