Ali Gulum

Sentiment Analysis Part 1: The Sentiment Analyst

Ali Gulum

Sentiment Analysis Part 1 (Sentiment Analyst)

Machine learning is reshaping technology at a pace that's hard to keep up with. Face detection, voice recognition, text classification: the applications are everywhere, and each one represents a different flavor of the same underlying idea: teaching software to extract meaning from data.

Sentiment analysis sits comfortably in that landscape. Wikipedia defines it well: sentiment analysis, also known as opinion mining or emotion AI, refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. In practice, it's widely applied to customer reviews, survey responses, social media, and healthcare data, with use cases spanning marketing, customer service, and clinical medicine.

As developers, keeping pace with this means actively bringing these capabilities into the products we build. In this article, we're going to work through one of the most practical and widely used applications of machine learning: sentiment analysis. We'll train a model using the IMDB Movie Reviews dataset, and then build a working example application that uses the trained model to predict whether a user's comment is positive or negative.

This is what the finished application will look like:

The full source code is available here.

The Example Project

The project is split into three parts:

  1. Sentiment Analyst handles all machine learning processes
  2. Trainer is responsible for training the model
  3. Interface handles interaction with the users

What Do We Need?

  1. Microsoft ML.NET
  2. Metro Set UI
  3. Html Agility Pack

Important Note: ML.NET only supports x64. Our example also targets x64, so if you try to run it on x86 you will run into problems.

Project Overview

The goal of this project is to build a practical, working example of machine learning in a real application. We'll train a model on IMDB movie reviews and use it to predict whether a user's comment about a movie is positive or negative.

One important point: we're training on movie reviews, so the model is tuned specifically to that domain. If you point it at Twitter messages or product feedback, the accuracy will drop. If you want to apply sentiment analysis to a different domain, you need to train your model on a dataset that matches that domain.

Part I: Sentiment Analyst

ML.NET supports a range of machine learning algorithms, but for sentiment analysis we only need to focus on the relevant subset of what the framework offers. Rather than scattering ML.NET calls throughout the codebase, we'll build a dedicated helper class that wraps the machine learning functionality in a clean, reusable way. The Trainer we'll build later will also need access to parts of this class, so keeping it organized from the start pays off quickly.

Let's start by creating an empty solution and naming it "Movie Reviews".

After clicking Create, Visual Studio scaffolds an empty solution. Now right-click the solution in Solution Explorer and select Add New Project.

Select Class Library and click Next.

Name the project "Sentiment Analysis" and click Create.

Now we have a solution with a Class Library project in it. Before writing any code, we need to install ML.NET via NuGet. Open the NuGet Package Manager by navigating to Tools > NuGet Package Manager > Manage NuGet Packages for Solution. Make sure you're on the Browse tab, search for ML.NET, and install it into the project we just created.

Once installation completes, you should see the ML.NET references listed under the References node in Solution Explorer.

One last step before we write any code: set the Platform Target to x64. Go to the project Properties, navigate to the Build tab, and switch the Platform Target dropdown to x64.

To summarize the setup steps:

  1. Create the solution and name it "Movie Reviews"
  2. Add a Class Library project and name it "Sentiment Analysis"
  3. Install the ML.NET NuGet package
  4. Set Platform Target to x64

With all of that done, here's how the project structure will look once we've created all the necessary files:

Root folder: SentimentAnalyst.cs

Models folder: CrossValidationResult.cs, Data.cs, Definitions.cs, LearningMethodResult.cs, Prediction.cs

Models

CrossValidationResult

A simple container class for returning cross-validation results from the training phase. It has four fields.

1public class CrossValidationResult
2{
3 public string Trainer;
4 public double AccuracyAverage;
5 public double AccuraciesStdDeviation;
6 public double AccuraciesConfidenceInterval95;
7}

Data

The class we use to pass training data to ML.NET. Column names are specified with their index positions. Our training data has two columns: Review and Sentiment.

1public class Data
2{
3 [ColumnName("review")] [LoadColumn(0)] public string Review { get; set; }
4 [ColumnName("sentiment")]
5 [LoadColumn(1)]
6 public bool Sentiment { get; set; }
7}

Definitions

Defines the available learning models, allowing us to experiment with different algorithms against the same training data.

1public enum Trainers
2{
3 LbfgsLogisticRegression,
4 SgdCalibrated,
5 SdcaLogisticRegression,
6 AveragedPerceptron,
7 LinearSvm
8}

LearningMethodResult

The class returned after a training run completes, carrying the evaluation metrics.

1public class LearningMethodResult
2{
3 public string Trainer;
4 public double Accuracy;
5 public double AreaUnderRocCurve;
6 public double F1Score;
7}

Prediction

The class returned after asking ML.NET to classify input data. Note that it inherits from the Data class, so it carries the input fields alongside the prediction output.

1public class Prediction : Data
2{
3 [ColumnName("PredictedLabel")]
4 public bool PredictionValue { get; set; }
5 public float Score { get; set; }
6}

SentimentAnalyst Class

This is where all the work happens. The class has three core responsibilities: loading data, training the model, and making predictions.

We start with two private fields for storing file path information.

1private readonly string _dataPath;
2private readonly string _modelPath;

Every ML.NET operation flows through a single shared context object.

1private readonly MLContext _mlContext;

We also need fields for the transformer and data view.

1private ITransformer _model;
2private IDataView _dataViewPrimary;

And two optional pipeline fields that give us flexibility to support all five learning algorithms. _trainingPipelinePlat is used for LbfgsLogisticRegression, SgdCalibrated, and SdcaLogisticRegression. _trainingPipeline is used for AveragedPerceptron and LinearSvm.

1private EstimatorChain<BinaryPredictionTransformer
2 CalibratedModelParametersBase<LinearBinaryModelParameters, PlattCalibrator>>> _trainingPipelinePlat;
3
4private EstimatorChain<BinaryPredictionTransformer<LinearBinaryModelParameters>> _trainingPipeline;

The constructor initializes everything.

1public SentimentAnalyst(string dataPath = null, string modelPath = null)
2{
3 _mlContext = new MLContext();
4 _dataPath = dataPath;
5 _modelPath = modelPath;
6}

LoadData

Machine learning is only as good as the data behind it. The LoadData function uses ML.NET's LoadFromTextFile method to read the training file and map it to our Data class. It then splits the dataset 80/20 into training and test sets.

1private DataOperationsCatalog.TrainTestData LoadData()
2{
3 IDataView dataView = null;
4 if (_dataPath == null)
5 throw new Exception("Data Path is undefined");
6
7 _dataViewPrimary = _mlContext.Data.LoadFromTextFile<Data>(
8 _dataPath,
9 hasHeader: true,
10 separatorChar: ',',
11 allowQuoting: true
12 );
13
14 dataView = _dataViewPrimary;
15 var splitDataView = _mlContext.Data.TrainTestSplit(dataView, 0.2);
16 return splitDataView;
17}

Train

Loads the data, builds and trains the model against the training set, evaluates it against the test set, saves the trained model to disk, and returns the evaluation metrics.

1public LearningMethodResult Train(Trainers targetTrainer = Trainers.SdcaLogisticRegression)
2{
3 var splitDataView = LoadData();
4
5 _targetTrainer = targetTrainer;
6 _model = BuildAndTrainModel(splitDataView.TrainSet);
7
8 var learningMethodResult = Evaluate(_model, splitDataView.TestSet);
9
10 var directoryInfo = new FileInfo(_modelPath).Directory;
11 if (directoryInfo != null)
12 {
13 var path = directoryInfo.FullName;
14 if (!Directory.Exists(path))
15 Directory.CreateDirectory(path);
16 }
17
18 _mlContext.Model.Save(_model, _dataViewPrimary.Schema, _modelPath);
19 return learningMethodResult;
20}

The full SentimentAnalyst class also includes TrainMultiple and CrossValidate functions. TrainMultiple runs all five learning algorithms against the same dataset in one pass so you can compare results. CrossValidate checks whether the model has overfit to the training data before you ship it to production. The complete class code is below.

1public class SentimentAnalyst
2{
3 private readonly string _dataPath;
4 private readonly string _modelPath;
5
6 private readonly MLContext _mlContext;
7 private ITransformer _model;
8 private IDataView _dataViewPrimary;
9
10 private Trainers _targetTrainer;
11
12 private EstimatorChain<BinaryPredictionTransformer
13 CalibratedModelParametersBase<LinearBinaryModelParameters, PlattCalibrator>>> _trainingPipelinePlat;
14
15 private EstimatorChain<BinaryPredictionTransformer<LinearBinaryModelParameters>> _trainingPipeline;
16
17 public SentimentAnalyst(string dataPath = null, string modelPath = null)
18 {
19 _mlContext = new MLContext();
20 _dataPath = dataPath;
21 _modelPath = modelPath;
22 }
23
24 public void LoadTrainedModel()
25 {
26 if (_modelPath == null)
27 throw new Exception("Model Path is undefined");
28
29 if (File.Exists(_modelPath))
30 _model = _mlContext.Model.Load(_modelPath, out _);
31 }
32
33 public LearningMethodResult Train(Trainers targetTrainer = Trainers.SdcaLogisticRegression)
34 {
35 var splitDataView = LoadData();
36
37 _targetTrainer = targetTrainer;
38 _model = BuildAndTrainModel(splitDataView.TrainSet);
39
40 var learningMethodResult = Evaluate(_model, splitDataView.TestSet);
41
42 var directoryInfo = new FileInfo(_modelPath).Directory;
43 if (directoryInfo != null)
44 {
45 var path = directoryInfo.FullName;
46 if (!Directory.Exists(path))
47 Directory.CreateDirectory(path);
48 }
49
50 _mlContext.Model.Save(_model, _dataViewPrimary.Schema, _modelPath);
51 return learningMethodResult;
52 }
53
54 public List<LearningMethodResult> TrainMultiple()
55 {
56 var learningMethodResults = new List<LearningMethodResult>();
57 var splitDataView = LoadData();
58
59 foreach (var trainer in (Trainers[]) Enum.GetValues(typeof(Trainers)))
60 {
61 _targetTrainer = trainer;
62 Console.WriteLine("Trainer:{0}", _targetTrainer);
63 _model = BuildAndTrainModel(splitDataView.TrainSet);
64 learningMethodResults.Add(Evaluate(_model, splitDataView.TestSet));
65 }
66
67 return learningMethodResults;
68 }
69
70 public CrossValidationResult CrossValidate(int folds = 5)
71 {
72 var crossValidationResult = new CrossValidationResult();
73 IReadOnlyList<TrainCatalogBase.CrossValidationResult<BinaryClassificationMetrics>> crossValidationResults = null;
74
75 if (_trainingPipelinePlat != null)
76 crossValidationResults =
77 _mlContext.BinaryClassification.CrossValidateNonCalibrated(_dataViewPrimary, _trainingPipelinePlat,
78 folds, "sentiment");
79 else if (_trainingPipeline != null)
80 crossValidationResults =
81 _mlContext.BinaryClassification.CrossValidateNonCalibrated(_dataViewPrimary, _trainingPipeline,
82 folds, "sentiment");
83
84 var metricsInMultipleFolds =
85 (crossValidationResults ?? throw new InvalidOperationException()).Select(r => r.Metrics);
86 var accuracyValues = metricsInMultipleFolds.Select(m => m.Accuracy);
87 var accuracyAverage = accuracyValues.Average();
88 var accuraciesStdDeviation = CalculateStandardDeviation(accuracyValues);
89 var accuraciesConfidenceInterval95 = CalculateConfidenceInterval95(accuracyValues);
90
91 crossValidationResult.AccuracyAverage = accuracyAverage;
92 crossValidationResult.AccuraciesStdDeviation = accuraciesStdDeviation;
93 crossValidationResult.AccuraciesConfidenceInterval95 = accuraciesConfidenceInterval95;
94 crossValidationResult.Trainer = _targetTrainer.ToString();
95
96 return crossValidationResult;
97 }
98
99 public Prediction Predicate(Data sentiment)
100 {
101 var predictionFunction = _mlContext.Model.CreatePredictionEngine<Data, Prediction>(_model);
102 return predictionFunction.Predict(sentiment);
103 }
104
105 public IEnumerable<Prediction> MultiPredicate(IEnumerable<Data> sentiments)
106 {
107 var sentimentPredictionResultList = new List<Prediction>();
108 var batchComments = _mlContext.Data.LoadFromEnumerable(sentiments);
109 var predictions = _model.Transform(batchComments);
110 var predictedResults = _mlContext.Data.CreateEnumerable<Prediction>(predictions, false);
111
112 foreach (var prediction in predictedResults)
113 {
114 var sentimentPrediction = new Prediction
115 {
116 PredictionValue = prediction.PredictionValue,
117 Score = prediction.Score
118 };
119 sentimentPredictionResultList.Add(sentimentPrediction);
120 }
121 return sentimentPredictionResultList;
122 }
123
124 private DataOperationsCatalog.TrainTestData LoadData()
125 {
126 IDataView dataView = null;
127
128 if (_dataPath == null)
129 throw new Exception("Data Path is undefined");
130
131 _dataViewPrimary = _mlContext.Data.LoadFromTextFile<Data>(
132 _dataPath,
133 hasHeader: true,
134 separatorChar: ',',
135 allowQuoting: true
136 );
137
138 dataView = _dataViewPrimary;
139 var splitDataView = _mlContext.Data.TrainTestSplit(dataView, 0.2);
140 return splitDataView;
141 }
142
143 private ITransformer BuildAndTrainModel(IDataView splitTrainSet)
144 {
145 var dataProcessPipeline = _mlContext.Transforms.Text.FeaturizeText("review_tf", "review")
146 .Append(_mlContext.Transforms.CopyColumns("Features", "review_tf"))
147 .Append(_mlContext.Transforms.NormalizeMinMax("Features", "Features")
148 .AppendCacheCheckpoint(_mlContext));
149
150 switch (_targetTrainer)
151 {
152 case Trainers.LbfgsLogisticRegression:
153 {
154 var trainer = _mlContext.BinaryClassification.Trainers.LbfgsLogisticRegression("sentiment");
155 _trainingPipelinePlat = dataProcessPipeline.Append(trainer);
156 _trainingPipeline = null;
157 return _trainingPipelinePlat.Fit(splitTrainSet);
158 }
159 case Trainers.SgdCalibrated:
160 {
161 var trainer = _mlContext.BinaryClassification.Trainers.SgdCalibrated("sentiment");
162 _trainingPipelinePlat = dataProcessPipeline.Append(trainer);
163 _trainingPipeline = null;
164 return _trainingPipelinePlat.Fit(splitTrainSet);
165 }
166 case Trainers.SdcaLogisticRegression:
167 {
168 var trainer = _mlContext.BinaryClassification.Trainers.SdcaLogisticRegression("sentiment");
169 _trainingPipelinePlat = dataProcessPipeline.Append(trainer);
170 _trainingPipeline = null;
171 return _trainingPipelinePlat.Fit(splitTrainSet);
172 }
173 case Trainers.AveragedPerceptron:
174 {
175 var trainer = _mlContext.BinaryClassification.Trainers.AveragedPerceptron("sentiment");
176 _trainingPipeline = dataProcessPipeline.Append(trainer);
177 _trainingPipelinePlat = null;
178 return _trainingPipeline.Fit(splitTrainSet);
179 }
180 case Trainers.LinearSvm:
181 {
182 var trainer = _mlContext.BinaryClassification.Trainers.LinearSvm("sentiment");
183 _trainingPipeline = dataProcessPipeline.Append(trainer);
184 _trainingPipelinePlat = null;
185 return _trainingPipeline.Fit(splitTrainSet);
186 }
187 default:
188 throw new ArgumentOutOfRangeException(nameof(_targetTrainer), _targetTrainer, null);
189 }
190 }
191
192 private LearningMethodResult Evaluate(ITransformer model, IDataView splitTestSet)
193 {
194 var learningMethodResult = new LearningMethodResult();
195 var predictions = model.Transform(splitTestSet);
196 var metrics = _mlContext.BinaryClassification.EvaluateNonCalibrated(predictions, "sentiment");
197
198 learningMethodResult.Accuracy = metrics.Accuracy;
199 learningMethodResult.AreaUnderRocCurve = metrics.AreaUnderRocCurve;
200 learningMethodResult.F1Score = metrics.F1Score;
201 learningMethodResult.Trainer = _targetTrainer.ToString();
202
203 return learningMethodResult;
204 }
205
206 private static double CalculateStandardDeviation(IEnumerable<double> values)
207 {
208 var average = values.Average();
209 var sumOfSquaresOfDifferences = values.Select(val => (val - average) * (val - average)).Sum();
210 var standardDeviation = Math.Sqrt(sumOfSquaresOfDifferences / (values.Count() - 1));
211 return standardDeviation;
212 }
213
214 private static double CalculateConfidenceInterval95(IEnumerable<double> values)
215 {
216 var confidenceInterval95 = 1.96 * CalculateStandardDeviation(values) / Math.Sqrt(values.Count() - 1);
217 return confidenceInterval95;
218 }
219}

And that wraps up Part 1. We now have a fully functional SentimentAnalyst class built on ML.NET, capable of loading data, training a model against it, evaluating its performance, and saving it to disk ready for use. In Part 2, we'll build the Trainer project that puts this class to work and produces our trained model.