Sentiment Analysis Part 2: The Trainer

Sentiment Analysis, Part 2: Trainer

In the previous article, we finished the first part of our example project. We now have a SentimentAnalyst class that handles both training and prediction: all we need to do is pass the right data to it.

Before jumping into the Trainer project, let's talk briefly about the dataset we'll be using.

The Dataset

The dataset comes from Kaggle and is in .csv format, which ML.NET can work with directly. You can download the dataset from this link, and the full source code is available here.

There's one preprocessing step worth noting: the raw dataset uses the strings "positive" and "negative" as sentiment labels, but ML.NET expects numeric class labels. So before using it, we replace all "negative" values with 0 and all "positive" values with 1. The updated dataset is what we'll be working with throughout this article.

The Trainer Project

The Trainer is the second and most critical part of the project. Every prediction the final application makes depends entirely on the quality of what happens here. A well-trained model produces reliable results; a poorly trained one won't, regardless of how well everything else is built.

Let's add the Trainer to the solution. Right-click the solution in Solution Explorer, select Add New Project, choose Console App, and click Next.

Name the project "Trainer" and click Create.

Once the project is created and added to the solution, create a Data folder inside the Trainer project and copy the preprocessed dataset into it. Your Solution Explorer should look like this:

There's one important configuration step here. Click on IMDBDataset.csv in Solution Explorer, then find the "Copy to Output Directory" setting in the Properties window and change it to "Copy if newer". Without this, the Trainer won't be able to find the dataset at runtime.

Writing the Code

The Trainer project is intentionally thin. The heavy lifting, all the actual machine learning work, is handled by the SentimentAnalyst class we built in Part 1. The Trainer's job is simply to pass the right parameters to it, kick off the training process, and display the results.

We start with a single field:

1private static readonly bool _debugMode = true;

This controls where the trained model gets saved, Debug or Release output folder, so the Interface project can find it automatically without any manual file copying.

GetParentDirectory

This helper function resolves the solution's root path at runtime, so the Trainer knows exactly where to save the trained model after training completes.

1private static string GetParentDirectory()
2{
3    var directoryInfo = Directory.GetParent(Directory.GetCurrentDirectory()).Parent;
4    if (directoryInfo?.Parent?.Parent != null)
5        return directoryInfo.Parent.Parent.FullName;
6    return string.Empty;
7}

Train

This is the core function. It passes the dataset to SentimentAnalyst, triggers the training process, and prints the evaluation metrics to the console so we can immediately see how the model performed.

1private static void Train(SentimentAnalyst sentimentAnalyst)
2{
3    var trainingResult = sentimentAnalyst.Train();
4    Console.WriteLine("===============================================");
5    Console.WriteLine("Accuracy:{0}", trainingResult.Accuracy);
6    Console.WriteLine("AreaUnderRocCurve:{0}", trainingResult.AreaUnderRocCurve);
7    Console.WriteLine("F1Score:{0}", trainingResult.F1Score);
8    Console.WriteLine("===============================================");
9}

TrainMultiple (optional)

Rather than committing to a single learning algorithm upfront, TrainMultiple runs all five supported models, LbfgsLogisticRegression, SgdCalibrated, SdcaLogisticRegression, AveragedPerceptron, and LinearSvm, against the same dataset in one pass and displays the results side by side. This is useful for identifying which algorithm fits your data best before settling on one for production.

1private static void TrainMultiple(SentimentAnalyst sentimentAnalyst)
2{
3    Console.WriteLine("Multiple Training");
4    var trainingResults = sentimentAnalyst.TrainMultiple();
5    Console.WriteLine("*".PadRight(105, '*'));
6    Console.WriteLine("*       Training Results");
7    Console.WriteLine("*".PadRight(105, '-'));
8    foreach (var trainingResult in trainingResults.OrderBy(x => x.Accuracy))
9    {
10        Console.WriteLine($"* Trainer: {trainingResult.Trainer}");
11        Console.WriteLine($"* Accuracy: {trainingResult.Accuracy:0.###}  - Area Under Roc Curve: ({trainingResult.AreaUnderRocCurve:#.###})  - F1 Score: ({trainingResult.F1Score:#.###})");
12    }
13    Console.WriteLine("*".PadRight(105, '*'));
14}

CrossValidation (optional)

Cross-validation is the most reliable way to check whether a model has overfit to the training data before shipping it to production. If the cross-validation accuracy is significantly lower than the training accuracy, that's a signal the model has memorized the training set rather than learned generalizable patterns.

1private static void CrossValidation(SentimentAnalyst sentimentAnalyst)
2{
3    Console.WriteLine("Cross Validating");
4    var validationResult = sentimentAnalyst.CrossValidate();
5    Console.WriteLine("*".PadRight(105, '*'));
6    Console.WriteLine("*       Metrics for Cross Validation");
7    Console.WriteLine("*".PadRight(105, '-'));
8    Console.WriteLine($"Trainer: {validationResult.Trainer}");
9    Console.WriteLine($"*       Average Accuracy: {validationResult.AccuracyAverage:0.###}  - Standard Deviation: ({validationResult.AccuraciesStdDeviation:#.###})  - Confidence Interval 95%: ({validationResult.AccuraciesConfidenceInterval95:#.###})");
10    Console.WriteLine("*".PadRight(105, '*'));
11}

Main

The entry point ties everything together. It builds the file paths for the dataset and the output model, initializes the SentimentAnalyst, and kicks off training. TrainMultiple and CrossValidation are left in as commented-out options: uncomment either one if you want to use them.

1private static void Main(string[] args)
2{
3    var trainingDataFile = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "Data", "IMDBDataset.csv");
4    var modelDataFile = Path.Combine(GetParentDirectory(),
5        _debugMode
6            ? $@"Movie Reviews\Movie Reviews\bin\{"Debug"}\Data"
7            : $@"Movie Reviews\Movie Reviews\bin\{"Release"}\Data",
8        "model.zip");
9
10    var sentimentAnalyst = new SentimentAnalyst(trainingDataFile, modelDataFile);
11    Console.WriteLine("Training");
12    Train(sentimentAnalyst);
13
14    // If you want to see how other models perform:
15    // TrainMultiple(sentimentAnalyst);
16
17    // If you want to run cross-validation:
18    // CrossValidation(sentimentAnalyst);
19
20    Console.WriteLine("Completed");
21    Console.ReadLine();
22}

And that's the Trainer. Run it, wait for training to complete, and you'll see the accuracy metrics printed to the console. The trained model gets saved automatically to the output folder where the Interface project will look for it.

In Part 3, we'll build the Interface: the user-facing layer that loads the trained model and puts it to work.

Sentiment Analysis Part 2: The Trainer

Sentiment Analysis, Part 2: Trainer

The Trainer Project

Writing the Code

Sentiment Analysis Part 1: The Sentiment Analyst

Sentiment Analysis Part 3: The Interface