Sunday, April 27, 2014

Weka template project for sentiment classification of an English text

Internet is buzzing about machine learning. Many folks use it for fun and profit

credits: Customers Who Bought This Item Also Bought PaulsHealthBlog.com, 11.04.2014

But! When a non-expert gets around started with these topics in practice, it becomes increasingly difficult to just get going. There are of course quick solutions, like scikit-learn library for Python. If you are a Java developer, there are a few options as well: Weka, Apache Mahout. Both of these are generic enough to be applied to different machine learning problems, including text classification. More tailored libraries and packages for text oriented machine learning in Java are Mallet and Stanford's set of libraries. There are as well some less known machine learning toolkits, like Java-ML.

This post will focus on Weka and will give you a very simple and working template project for classifying sentiment in the English text. Specifically, we will create three way sentiment classifier using Multinomial Naive Bayes algorithm.

In my view, there are three main practical problems, that a programmer seeks to find solutions for using a machine learning library: setting up a classifier algorithm, adding training instances (effectively, training the classifier) and I/O (storing and retrieving a model). Beyond this and of high importance is measuring the quality of the trained model that we will take a look at as well.

Setting up a classifier

As mentioned above, we will use the Multinomial Naive Bayes algorithm. To get going, let's set it up for the three way sentiment classification:


public class ThreeWayMNBTrainer {
    private NaiveBayesMultinomialText classifier;
    private String modelFile;
    private Instances dataRaw;

    public ThreeWayMNBTrainer(String outputModel) {
        // create the classifier
        classifier = new NaiveBayesMultinomialText();
        // filename for outputting the trained model
        modelFile = outputModel;

        // listing class labels
        ArrayList<attribute> atts = new ArrayList<attribute>(2);
        ArrayList<string> classVal = new ArrayList<string>();
        classVal.add(SentimentClass.ThreeWayClazz.NEGATIVE.name());
        classVal.add(SentimentClass.ThreeWayClazz.POSITIVE.name());
        atts.add(new Attribute("content",(ArrayList<string>)null));
        atts.add(new Attribute("@@class@@",classVal));
        // create the instances data structure
        dataRaw = new Instances("TrainingInstances",atts,10);
    }

}

What goes in the above code is:
  • Create the classifier
  • List the target labels: NEGATIVE and POSITIVE
  • Create the instances data structure
In a similar fashion, but with more classes (target labels) we'd set up a five way classifier, using the same algorithm under the hood:

public class FiveWayMNBTrainer {
    private NaiveBayesMultinomialText classifier;
    private String modelFile;
    private Instances dataRaw;

    public FiveWayMNBTrainer(String outputModel) {
        classifier = new NaiveBayesMultinomialText();
        classifier.setLowercaseTokens(true);
        classifier.setUseWordFrequencies(true);

        modelFile = outputModel;

        ArrayList<Attribute> atts = new ArrayList<Attribute>(2);
        ArrayList<String> classVal = new ArrayList<String>();
        classVal.add(SentimentClass.FiveWayClazz.NEGATIVE.name());
        classVal.add(SentimentClass.FiveWayClazz.SOMEWHAT_NEGATIVE.name());
        classVal.add(SentimentClass.FiveWayClazz.NEUTRAL.name());
        classVal.add(SentimentClass.FiveWayClazz.SOMEWHAT_POSITIVE.name());
        classVal.add(SentimentClass.FiveWayClazz.POSITIVE.name());
        atts.add(new Attribute("content",(ArrayList<String>)null));
        atts.add(new Attribute("@@class@@",classVal));

        dataRaw = new Instances("TrainingInstances",atts,10);
    }
}

Adding training instances (training a classifier)

Training the classifier is the process of showing examples to the algorithm. An example usually consists of a set of pairs (x,y), where x is a feature vector and y is a label for this vector. In the context of sentiment analysis specifically, a feature vector can be words (unigrams) in a sentence and a label is sentiment: NEGATIVE, NEUTRAL or POSITIVE in the case of three way sentiment classification. The algorithm is expected to learn from the example set and generalize to predict labels y' for the previously unseen vectors x'.

Engineering the features is both the mix of art and mechanical work, as I've once mentioned. And also finding good classifier options can be a task for statistical analysis with visualization.

Let's implement the method for adding the training instances for three way classification:

public void addTrainingInstance(SentimentClass.ThreeWayClazz threeWayClazz, String[] words) {
        double[] instanceValue = new double[dataRaw.numAttributes()];
        instanceValue[0] = dataRaw.attribute(0).addStringValue(Join.join(" ", words));
        instanceValue[1] = threeWayClazz.ordinal();
        dataRaw.add(new DenseInstance(1.0, instanceValue));
        dataRaw.setClassIndex(1);
    }

So basically we put input unigrams (words) as a String x value and integer of label as y value, thus forming a training instance for the algorithm. Next the algorithm will internally tokenize the input string sequence and update the necessary probabilities.

For five way classification the above method looks almost the same, except the first parameter is of type SentimentClass.FiveWayClazz.

Training the model after we have finished adding the training examples is quite simple:

public void trainModel() throws Exception {
        classifier.buildClassifier(dataRaw);
    }

That's it!

I/O (storing and retrieving the trained model)

It is ok to train a model and classify right a way. But, that does not work, if you want to develop your model and ship that to production. In production mode your trained classifier will do its main work: classify new instances. So your model must be pre-trained and exist on disk. Storing and loading a trained model with Weka is extremely easy. This is thanks to the fact the classifiers extend abstract class AbstractClassifier, which in turn implements Serializable interface among others.

Storing the trained model is as easy as:

public void saveModel() throws Exception {
        weka.core.SerializationHelper.write(modelFile, classifier);
    }

And loading the model is easy too:
public void loadModel(String _modelFile) throws Exception {
        NaiveBayesMultinomialText classifier = (NaiveBayesMultinomialText) weka.core.SerializationHelper.read(_modelFile);
        this.classifier = classifier;
    }


After we have loaded the model, let's classify some texts. The method for the three way classification is:

public SentimentClass.ThreeWayClazz classify(String sentence) throws Exception {
        double[] instanceValue = new double[dataRaw.numAttributes()];
        instanceValue[0] = dataRaw.attribute(0).addStringValue(sentence);

        Instance toClassify = new DenseInstance(1.0, instanceValue);
        dataRaw.setClassIndex(1);
        toClassify.setDataset(dataRaw);

        double prediction = this.classifier.classifyInstance(toClassify);

        double distribution[] = this.classifier.distributionForInstance(toClassify);
        if (distribution[0] != distribution[1])
            return SentimentClass.ThreeWayClazz.values()[(int)prediction];
        else
            return SentimentClass.ThreeWayClazz.NEUTRAL;
    }

Notice the line #12. Remember, that we have defined the target classes for the three way classifier as {NEGATIVE, POSITIVE}. So in principle our classifier should be capable to do the binary classification. But! In the event when the probability distribution between the classes is exactly equal, we can safely assume it is NEUTRAL class. So we get the three way classifier. The following test case should ideally pass:

@org.junit.Test
    public void testArbitraryTextPositive() throws Exception {
        threeWayMnbTrainer.loadModel(modelFile);
        Assert.assertEquals(SentimentClass.ThreeWayClazz.POSITIVE, threeWayMnbTrainer.classify("I like this weather"));
    }

Neat!

To wrap things up, here is the "runner" class that builds the three-way classifier, saves the model and tests it for quality over the training data:

public class ThreeWayMNBTrainerRunner {
    public static void main(String[] args) throws Exception {
        KaggleCSVReader kaggleCSVReader = new KaggleCSVReader();
        kaggleCSVReader.readKaggleCSV("kaggle/train.tsv");
        KaggleCSVReader.CSVInstanceThreeWay csvInstanceThreeWay;

        String outputModel = "models/three-way-sentiment-mnb.model";

        ThreeWayMNBTrainer threeWayMNBTrainer = new ThreeWayMNBTrainer(outputModel);

        System.out.println("Adding training instances");
        int addedNum = 0;
        while ((csvInstanceThreeWay = kaggleCSVReader.next()) != null) {
            if (csvInstanceThreeWay.isValidInstance) {
                threeWayMNBTrainer.addTrainingInstance(csvInstanceThreeWay.sentiment, csvInstanceThreeWay.phrase.split("\\s+"));
                addedNum++;
            }
        }

        kaggleCSVReader.close();

        System.out.println("Added " + addedNum + " instances");

        System.out.println("Training and saving Model");
        threeWayMNBTrainer.trainModel();
        threeWayMNBTrainer.saveModel();

        System.out.println("Testing model");
        threeWayMNBTrainer.testModel();
    }
}



The quality of the model

Testing the trained model is fairly easy with Weka as well. Knowing the quality of your model is important because you want to make sure that there is no under- or overfitting happening. Underfitting means you haven't found an optimum of features describing your fenomena to fully utilize your training data, thus the model is long-sighted or too simple. Overfitting means you deal with over-learning your training data and over-adjusting for it, i.e. the model does not generalize for real world instances and becomes too short-sighted or too complex.

There are different ways to test the model, one is use part of you training data as test data (for example one third) and perform N fold cross-validation. I.e. on each iteration take a new piece of training data for test data and compute sensible metrics, like precision / recall / accuracy etc. In the end of the cross-validation take average over computed values. This will be your "amortized" quality.

We can also take a peek look at the quality by just counting the number of correctly classified instances from the training data:

    public void testModel() throws Exception {
        Evaluation eTest = new Evaluation(dataRaw);
        eTest.evaluateModel(classifier, dataRaw);
        String strSummary = eTest.toSummaryString();
        System.out.println(strSummary);
    }

The method outputs the following statistics:

Correctly Classified Instances       28625               83.3455 %
Incorrectly Classified Instances      5720               16.6545 %
Kappa statistic                          0.4643
Mean absolute error                      0.2354
Root mean squared error                  0.3555
Relative absolute error                 71.991  %
Root relative squared error             87.9228 %
Coverage of cases (0.95 level)          97.7697 %
Mean rel. region size (0.95 level)      83.3426 %
Total Number of Instances            34345     

The code can be found on my github. It utilizes the data posted on kaggle. So if you want to use the code as is (and perhaps even make a submission) you need to accept the terms of the kaggle competition and download the training set. I leave the exercise of implementing the full code for five-way classification and code for classifying kaggle's test set to the reader.