Sentiment Analysis Part 1: The Sentiment Analyst

Sentiment Analysis Part 1 (Sentiment Analyst)

Machine learning is reshaping technology at a pace that's hard to keep up with. Face detection, voice recognition, text classification: the applications are everywhere, and each one represents a different flavor of the same underlying idea: teaching software to extract meaning from data.

Sentiment analysis sits comfortably in that landscape. Wikipedia defines it well: sentiment analysis, also known as opinion mining or emotion AI, refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. In practice, it's widely applied to customer reviews, survey responses, social media, and healthcare data, with use cases spanning marketing, customer service, and clinical medicine.

As developers, keeping pace with this means actively bringing these capabilities into the products we build. In this article, we're going to work through one of the most practical and widely used applications of machine learning: sentiment analysis. We'll train a model using the IMDB Movie Reviews dataset, and then build a working example application that uses the trained model to predict whether a user's comment is positive or negative.

This is what the finished application will look like:

The full source code is available here.

The Example Project

The project is split into three parts:

Sentiment Analyst handles all machine learning processes
Trainer is responsible for training the model
Interface handles interaction with the users

What Do We Need?

Important Note: ML.NET only supports x64. Our example also targets x64, so if you try to run it on x86 you will run into problems.

Project Overview

The goal of this project is to build a practical, working example of machine learning in a real application. We'll train a model on IMDB movie reviews and use it to predict whether a user's comment about a movie is positive or negative.

One important point: we're training on movie reviews, so the model is tuned specifically to that domain. If you point it at Twitter messages or product feedback, the accuracy will drop. If you want to apply sentiment analysis to a different domain, you need to train your model on a dataset that matches that domain.

Part I: Sentiment Analyst

ML.NET supports a range of machine learning algorithms, but for sentiment analysis we only need to focus on the relevant subset of what the framework offers. Rather than scattering ML.NET calls throughout the codebase, we'll build a dedicated helper class that wraps the machine learning functionality in a clean, reusable way. The Trainer we'll build later will also need access to parts of this class, so keeping it organized from the start pays off quickly.

Let's start by creating an empty solution and naming it "Movie Reviews".

After clicking Create, Visual Studio scaffolds an empty solution. Now right-click the solution in Solution Explorer and select Add New Project.

Select Class Library and click Next.

Name the project "Sentiment Analysis" and click Create.

Now we have a solution with a Class Library project in it. Before writing any code, we need to install ML.NET via NuGet. Open the NuGet Package Manager by navigating to Tools > NuGet Package Manager > Manage NuGet Packages for Solution. Make sure you're on the Browse tab, search for ML.NET, and install it into the project we just created.

Once installation completes, you should see the ML.NET references listed under the References node in Solution Explorer.

One last step before we write any code: set the Platform Target to x64. Go to the project Properties, navigate to the Build tab, and switch the Platform Target dropdown to x64.

To summarize the setup steps:

Create the solution and name it "Movie Reviews"
Add a Class Library project and name it "Sentiment Analysis"
Install the ML.NET NuGet package
Set Platform Target to x64

With all of that done, here's how the project structure will look once we've created all the necessary files:

Root folder: SentimentAnalyst.cs

Models folder: CrossValidationResult.cs, Data.cs, Definitions.cs, LearningMethodResult.cs, Prediction.cs

Models

CrossValidationResult

A simple container class for returning cross-validation results from the training phase. It has four fields.

1public class CrossValidationResult
2{
3    public string Trainer;
4    public double AccuracyAverage;
5    public double AccuraciesStdDeviation;
6    public double AccuraciesConfidenceInterval95;
7}

Data

The class we use to pass training data to ML.NET. Column names are specified with their index positions. Our training data has two columns: Review and Sentiment.

1public class Data
2{
3    [ColumnName("review")] [LoadColumn(0)] public string Review { get; set; }
4    [ColumnName("sentiment")]
5    [LoadColumn(1)]
6    public bool Sentiment { get; set; }
7}

Definitions

Defines the available learning models, allowing us to experiment with different algorithms against the same training data.

1public enum Trainers
2{
3    LbfgsLogisticRegression,
4    SgdCalibrated,
5    SdcaLogisticRegression,
6    AveragedPerceptron,
7    LinearSvm
8}

LearningMethodResult

The class returned after a training run completes, carrying the evaluation metrics.

1public class LearningMethodResult
2{
3    public string Trainer;
4    public double Accuracy;
5    public double AreaUnderRocCurve;
6    public double F1Score;
7}

Prediction

The class returned after asking ML.NET to classify input data. Note that it inherits from the Data class, so it carries the input fields alongside the prediction output.

1public class Prediction : Data
2{
3    [ColumnName("PredictedLabel")]
4    public bool PredictionValue { get; set; }
5    public float Score { get; set; }
6}

SentimentAnalyst Class

This is where all the work happens. The class has three core responsibilities: loading data, training the model, and making predictions.

We start with two private fields for storing file path information.

1private readonly string _dataPath;
2private readonly string _modelPath;

Every ML.NET operation flows through a single shared context object.

1private readonly MLContext _mlContext;

We also need fields for the transformer and data view.

1private ITransformer _model;
2private IDataView _dataViewPrimary;

And two optional pipeline fields that give us flexibility to support all five learning algorithms. _trainingPipelinePlat is used for LbfgsLogisticRegression, SgdCalibrated, and SdcaLogisticRegression. _trainingPipeline is used for AveragedPerceptron and LinearSvm.

1private EstimatorChain<BinaryPredictionTransformer
2    CalibratedModelParametersBase<LinearBinaryModelParameters, PlattCalibrator>>> _trainingPipelinePlat;
3
4private EstimatorChain<BinaryPredictionTransformer<LinearBinaryModelParameters>> _trainingPipeline;

The constructor initializes everything.

1public SentimentAnalyst(string dataPath = null, string modelPath = null)
2{
3    _mlContext = new MLContext();
4    _dataPath = dataPath;
5    _modelPath = modelPath;
6}

LoadData

Machine learning is only as good as the data behind it. The LoadData function uses ML.NET's LoadFromTextFile method to read the training file and map it to our Data class. It then splits the dataset 80/20 into training and test sets.

1private DataOperationsCatalog.TrainTestData LoadData()
2{
3    IDataView dataView = null;
4    if (_dataPath == null)
5        throw new Exception("Data Path is undefined");
6
7    _dataViewPrimary = _mlContext.Data.LoadFromTextFile<Data>(
8        _dataPath,
9        hasHeader: true,
10        separatorChar: ',',
11        allowQuoting: true
12    );
13
14    dataView = _dataViewPrimary;
15    var splitDataView = _mlContext.Data.TrainTestSplit(dataView, 0.2);
16    return splitDataView;
17}

Train

Loads the data, builds and trains the model against the training set, evaluates it against the test set, saves the trained model to disk, and returns the evaluation metrics.

1public LearningMethodResult Train(Trainers targetTrainer = Trainers.SdcaLogisticRegression)
2{
3    var splitDataView = LoadData();
4
5    _targetTrainer = targetTrainer;
6    _model = BuildAndTrainModel(splitDataView.TrainSet);
7
8    var learningMethodResult = Evaluate(_model, splitDataView.TestSet);
9
10    var directoryInfo = new FileInfo(_modelPath).Directory;
11    if (directoryInfo != null)
12    {
13        var path = directoryInfo.FullName;
14        if (!Directory.Exists(path))
15            Directory.CreateDirectory(path);
16    }
17
18    _mlContext.Model.Save(_model, _dataViewPrimary.Schema, _modelPath);
19    return learningMethodResult;
20}

The full SentimentAnalyst class also includes TrainMultiple and CrossValidate functions. TrainMultiple runs all five learning algorithms against the same dataset in one pass so you can compare results. CrossValidate checks whether the model has overfit to the training data before you ship it to production. The complete class code is below.

1public class SentimentAnalyst
2{
3    private readonly string _dataPath;
4    private readonly string _modelPath;
5
6    private readonly MLContext _mlContext;
7    private ITransformer _model;
8    private IDataView _dataViewPrimary;
9
10    private Trainers _targetTrainer;
11
12    private EstimatorChain<BinaryPredictionTransformer
13        CalibratedModelParametersBase<LinearBinaryModelParameters, PlattCalibrator>>> _trainingPipelinePlat;
14
15    private EstimatorChain<BinaryPredictionTransformer<LinearBinaryModelParameters>> _trainingPipeline;
16
17    public SentimentAnalyst(string dataPath = null, string modelPath = null)
18    {
19        _mlContext = new MLContext();
20        _dataPath = dataPath;
21        _modelPath = modelPath;
22    }
23
24    public void LoadTrainedModel()
25    {
26        if (_modelPath == null)
27            throw new Exception("Model Path is undefined");
28
29        if (File.Exists(_modelPath))
30            _model = _mlContext.Model.Load(_modelPath, out _);
31    }
32
33    public LearningMethodResult Train(Trainers targetTrainer = Trainers.SdcaLogisticRegression)
34    {
35        var splitDataView = LoadData();
36
37        _targetTrainer = targetTrainer;
38        _model = BuildAndTrainModel(splitDataView.TrainSet);
39
40        var learningMethodResult = Evaluate(_model, splitDataView.TestSet);
41
42        var directoryInfo = new FileInfo(_modelPath).Directory;
43        if (directoryInfo != null)
44        {
45            var path = directoryInfo.FullName;
46            if (!Directory.Exists(path))
47                Directory.CreateDirectory(path);
48        }
49
50        _mlContext.Model.Save(_model, _dataViewPrimary.Schema, _modelPath);
51        return learningMethodResult;
52    }
53
54    public List<LearningMethodResult> TrainMultiple()
55    {
56        var learningMethodResults = new List<LearningMethodResult>();
57        var splitDataView = LoadData();
58
59        foreach (var trainer in (Trainers[]) Enum.GetValues(typeof(Trainers)))
60        {
61            _targetTrainer = trainer;
62            Console.WriteLine("Trainer:{0}", _targetTrainer);
63            _model = BuildAndTrainModel(splitDataView.TrainSet);
64            learningMethodResults.Add(Evaluate(_model, splitDataView.TestSet));
65        }
66
67        return learningMethodResults;
68    }
69
70    public CrossValidationResult CrossValidate(int folds = 5)
71    {
72        var crossValidationResult = new CrossValidationResult();
73        IReadOnlyList<TrainCatalogBase.CrossValidationResult<BinaryClassificationMetrics>> crossValidationResults = null;
74
75        if (_trainingPipelinePlat != null)
76            crossValidationResults =
77                _mlContext.BinaryClassification.CrossValidateNonCalibrated(_dataViewPrimary, _trainingPipelinePlat,
78                    folds, "sentiment");
79        else if (_trainingPipeline != null)
80            crossValidationResults =
81                _mlContext.BinaryClassification.CrossValidateNonCalibrated(_dataViewPrimary, _trainingPipeline,
82                    folds, "sentiment");
83
84        var metricsInMultipleFolds =
85            (crossValidationResults ?? throw new InvalidOperationException()).Select(r => r.Metrics);
86        var accuracyValues = metricsInMultipleFolds.Select(m => m.Accuracy);
87        var accuracyAverage = accuracyValues.Average();
88        var accuraciesStdDeviation = CalculateStandardDeviation(accuracyValues);
89        var accuraciesConfidenceInterval95 = CalculateConfidenceInterval95(accuracyValues);
90
91        crossValidationResult.AccuracyAverage = accuracyAverage;
92        crossValidationResult.AccuraciesStdDeviation = accuraciesStdDeviation;
93        crossValidationResult.AccuraciesConfidenceInterval95 = accuraciesConfidenceInterval95;
94        crossValidationResult.Trainer = _targetTrainer.ToString();
95
96        return crossValidationResult;
97    }
98
99    public Prediction Predicate(Data sentiment)
100    {
101        var predictionFunction = _mlContext.Model.CreatePredictionEngine<Data, Prediction>(_model);
102        return predictionFunction.Predict(sentiment);
103    }
104
105    public IEnumerable<Prediction> MultiPredicate(IEnumerable<Data> sentiments)
106    {
107        var sentimentPredictionResultList = new List<Prediction>();
108        var batchComments = _mlContext.Data.LoadFromEnumerable(sentiments);
109        var predictions = _model.Transform(batchComments);
110        var predictedResults = _mlContext.Data.CreateEnumerable<Prediction>(predictions, false);
111
112        foreach (var prediction in predictedResults)
113        {
114            var sentimentPrediction = new Prediction
115            {
116                PredictionValue = prediction.PredictionValue,
117                Score = prediction.Score
118            };
119            sentimentPredictionResultList.Add(sentimentPrediction);
120        }
121        return sentimentPredictionResultList;
122    }
123
124    private DataOperationsCatalog.TrainTestData LoadData()
125    {
126        IDataView dataView = null;
127
128        if (_dataPath == null)
129            throw new Exception("Data Path is undefined");
130
131        _dataViewPrimary = _mlContext.Data.LoadFromTextFile<Data>(
132            _dataPath,
133            hasHeader: true,
134            separatorChar: ',',
135            allowQuoting: true
136        );
137
138        dataView = _dataViewPrimary;
139        var splitDataView = _mlContext.Data.TrainTestSplit(dataView, 0.2);
140        return splitDataView;
141    }
142
143    private ITransformer BuildAndTrainModel(IDataView splitTrainSet)
144    {
145        var dataProcessPipeline = _mlContext.Transforms.Text.FeaturizeText("review_tf", "review")
146            .Append(_mlContext.Transforms.CopyColumns("Features", "review_tf"))
147            .Append(_mlContext.Transforms.NormalizeMinMax("Features", "Features")
148                .AppendCacheCheckpoint(_mlContext));
149
150        switch (_targetTrainer)
151        {
152            case Trainers.LbfgsLogisticRegression:
153            {
154                var trainer = _mlContext.BinaryClassification.Trainers.LbfgsLogisticRegression("sentiment");
155                _trainingPipelinePlat = dataProcessPipeline.Append(trainer);
156                _trainingPipeline = null;
157                return _trainingPipelinePlat.Fit(splitTrainSet);
158            }
159            case Trainers.SgdCalibrated:
160            {
161                var trainer = _mlContext.BinaryClassification.Trainers.SgdCalibrated("sentiment");
162                _trainingPipelinePlat = dataProcessPipeline.Append(trainer);
163                _trainingPipeline = null;
164                return _trainingPipelinePlat.Fit(splitTrainSet);
165            }
166            case Trainers.SdcaLogisticRegression:
167            {
168                var trainer = _mlContext.BinaryClassification.Trainers.SdcaLogisticRegression("sentiment");
169                _trainingPipelinePlat = dataProcessPipeline.Append(trainer);
170                _trainingPipeline = null;
171                return _trainingPipelinePlat.Fit(splitTrainSet);
172            }
173            case Trainers.AveragedPerceptron:
174            {
175                var trainer = _mlContext.BinaryClassification.Trainers.AveragedPerceptron("sentiment");
176                _trainingPipeline = dataProcessPipeline.Append(trainer);
177                _trainingPipelinePlat = null;
178                return _trainingPipeline.Fit(splitTrainSet);
179            }
180            case Trainers.LinearSvm:
181            {
182                var trainer = _mlContext.BinaryClassification.Trainers.LinearSvm("sentiment");
183                _trainingPipeline = dataProcessPipeline.Append(trainer);
184                _trainingPipelinePlat = null;
185                return _trainingPipeline.Fit(splitTrainSet);
186            }
187            default:
188                throw new ArgumentOutOfRangeException(nameof(_targetTrainer), _targetTrainer, null);
189        }
190    }
191
192    private LearningMethodResult Evaluate(ITransformer model, IDataView splitTestSet)
193    {
194        var learningMethodResult = new LearningMethodResult();
195        var predictions = model.Transform(splitTestSet);
196        var metrics = _mlContext.BinaryClassification.EvaluateNonCalibrated(predictions, "sentiment");
197
198        learningMethodResult.Accuracy = metrics.Accuracy;
199        learningMethodResult.AreaUnderRocCurve = metrics.AreaUnderRocCurve;
200        learningMethodResult.F1Score = metrics.F1Score;
201        learningMethodResult.Trainer = _targetTrainer.ToString();
202
203        return learningMethodResult;
204    }
205
206    private static double CalculateStandardDeviation(IEnumerable<double> values)
207    {
208        var average = values.Average();
209        var sumOfSquaresOfDifferences = values.Select(val => (val - average) * (val - average)).Sum();
210        var standardDeviation = Math.Sqrt(sumOfSquaresOfDifferences / (values.Count() - 1));
211        return standardDeviation;
212    }
213
214    private static double CalculateConfidenceInterval95(IEnumerable<double> values)
215    {
216        var confidenceInterval95 = 1.96 * CalculateStandardDeviation(values) / Math.Sqrt(values.Count() - 1);
217        return confidenceInterval95;
218    }
219}

And that wraps up Part 1. We now have a fully functional SentimentAnalyst class built on ML.NET, capable of loading data, training a model against it, evaluating its performance, and saving it to disk ready for use. In Part 2, we'll build the Trainer project that puts this class to work and produces our trained model.