My paper on rule-based sentiment was accepted to Dialog'2012, special section on ROMIP'2011. The ROMIP had a track on 2-way and 3-way sentiment classification of texts in Russian last year. In our team with @vporoshin we had three major systems:
1. Rule-based described in the paper.
2. Modified multinomial Naive Bayes trained on unigrams and bigrams.
3. Classifier ensemble of the two above.
Rule-based approach largely relies on the pre-crafted polarity dictionary. It means, that it knows only those polarity word sequences, that it has in the dictionary. The MNB classifier in contrast learns such sequences from training set. They also have other differences. MNB is in a way a bag-of-words approach, but may work surprisingly well. In 2-way classification it has shown accuracy of 90+% for one of the domains. The rule-based algorithm has interesting linguistic features, like object oriented sentiment detection. Although this first time, the ROMIP's sentiment tracks did not require an object oriented detection, the test data had an object name (e.g. movie title or product name) attributed to each text to classify. Both object oriented and general sentiment detection has performed equally well and above 50% (i.e. above the accuracy of a coin tossing method). Overall accuracy of the general rule-based classification is 63% with 92% precision for the positive class. This generally means that more polarity words should be mined for the negative class and the existing negative polarity dictionary revised (some words could be of positive or ambiguous polarity).
Some more numbers in the paper:
1. Rule-based described in the paper.
2. Modified multinomial Naive Bayes trained on unigrams and bigrams.
3. Classifier ensemble of the two above.
Rule-based approach largely relies on the pre-crafted polarity dictionary. It means, that it knows only those polarity word sequences, that it has in the dictionary. The MNB classifier in contrast learns such sequences from training set. They also have other differences. MNB is in a way a bag-of-words approach, but may work surprisingly well. In 2-way classification it has shown accuracy of 90+% for one of the domains. The rule-based algorithm has interesting linguistic features, like object oriented sentiment detection. Although this first time, the ROMIP's sentiment tracks did not require an object oriented detection, the test data had an object name (e.g. movie title or product name) attributed to each text to classify. Both object oriented and general sentiment detection has performed equally well and above 50% (i.e. above the accuracy of a coin tossing method). Overall accuracy of the general rule-based classification is 63% with 92% precision for the positive class. This generally means that more polarity words should be mined for the negative class and the existing negative polarity dictionary revised (some words could be of positive or ambiguous polarity).
Some more numbers in the paper: