.NET Tools

Getting started with ML.NET

Machine learning continues to be a hot topic among developers and non-developers. But, regardless of your opinion, our AI-powered companions are here to stay with applied instances of machine learning models such as ChatGPT, GitHub Copilot, and Midjourney, captivating millions of users worldwide. While it may seem like a mysterious black box of magic, many of these models operate on a combination of basic tenets of machine learning: data mining, algorithms, and optimization.

While I’m no machine learning expert, I’ve explored the space and have a working understanding of its application in .NET. In this post, I’ll introduce you to ML.NET, a library developed by Microsoft to train, optimize, and deploy trained models based on your datasets.

Types of Machine Learning

Three recognized categories of machine learning models are Supervised, Unsupervised, and Semi-supervised. Understanding your ultimate goal can help you pick the appropriate approach for your needs.

With Supervised machine learning, you typically need to spend the majority of your effort curating a dataset. This learning method involves labeling and cleaning datasets to ensure the information you train your model on is as accurate as possible. The adage of “Garbage-in, Garbage-out” very much holds here. 

In a supervised training session, you use some of your data to train the model while using another percentage to validate the prediction results. Finding the best fit is crucial; more accurate and clean data is typically better. It’s common to see text-based datasets in the order of gigabytes.

Once your data is labeled, you can build neural networks, linear regressions, logistic regression, random forest models, or other approaches. These models are what power recommendation engines you’d see on your favorite streaming service or online shopping outlets.

You can use unsupervised learning to determine patterns in unlabeled datasets. Using these techniques, you can uncover the information you weren’t aware was there. You can use these models for pattern and image recognition. If you’ve had to prove you’re not a robot, you’ve encountered (and trained) these models.

Finally, Semi-supervised learning mixes the previously-mentioned approaches to provide an unsupervised learning environment with guard rails. The IBM documentation states:

“During training, it uses a smaller labeled data set to guide classification and feature extraction from a larger, unlabeled data set. Semi-supervised learning can solve the problem of not having enough labeled data for a supervised learning algorithm. It also helps if it’s too costly to label enough data.”

IBM documentation

Today’s typical machine learning applications include speech recognition, image detection, chatbots, recommendation engines, and fraud detection. As a result, the likelihood of using a model in your daily activities is almost inevitable.

Regardless of what approach you ultimately decide on, you’ll still have to do a lot of work with data and think critically about your models’ output. Just because a machine does it doesn’t mean it absolves you of the ethical implications of your model.

What is ML.NET?

ML.NET is an open-source machine learning framework for .NET applications. It is a set of APIs that can help you train, build, and ship custom machine-learning models. In addition to making custom models, you can also use ML.NET to import models from other ecosystems, using formats such as Open Neural Network Exchange (ONNX) specification, TensorFlow, or Infer.NET

These ecosystems have rich pre-trained models for image classification, object detection, and speech recognition. Starting with existing models and optimizing is expected in the machine-learning space, and ML.NET makes that straightforward. Most teams will lack the resources to train current-generation models in these areas, so fine-tuning existing models allows them to benefit from the knowledge captured by these models while adapting them to their own problem space.

You can use ML.NET from C# and F# applications, in multiple host environments, including desktop, mobile, and web.

ML.NET also includes a utility named AutoML. AutoML allows you to easily provide a dataset to a CLI interface and quickly choose your intent, train a model, and verify its predictive outcomes. Once complete, AutoML can also generate boilerplate code to consume your new model in an existing application.

Your First ML.NET Application

In this sample, you’ll use ML.NET to perform sentiment analysis. I adapted this from the original tutorial on Microsoft Documentation, clarifying the steps required to train, fit, and use a model. You’ll be using .NET 7 and Spectre.Console to build a neat predictive REPL.

A personal word of caution, ML.NET was written by data scientists for data scientists. Therefore, some code constructs may feel idiomatically strange for many C# developers.

Start by creating a new Console Application and add the dependencies of ML.NET and Spectre.Console. The packages are Microsoft.ML and Spectre.Console respectively.

Every machine learning model starts with data. In this tutorial, you’ll use 1,000 lines of yelp reviews to build a model to predict if new reviews skew positively or negatively. Here’s a piece of the tab-delimited Yelp sentiment dataset, which contains restaurant reviews. 

Wow... Loved this place.	1
Crust is not good.	0
Not tasty and the texture was just nasty.	0

Download the Yelp review sentiment dataset and add it to your console application as a file to be copied to your output directory. The creators have already labeled the data with positive and negative labels. However, feel free to look at it in your favorite spreadsheet editor if compelled.  

Once complete, your .csproj file should look like the following, with version numbers possibly being different.

<Project Sdk="Microsoft.NET.Sdk">

    <PropertyGroup>
        <OutputType>Exe</OutputType>
        <TargetFramework>net7.0</TargetFramework>
        <ImplicitUsings>enable</ImplicitUsings>
        <Nullable>enable</Nullable>
    </PropertyGroup>

    <ItemGroup>
      <PackageReference Include="Microsoft.ML" Version="2.0.0" />
      <PackageReference Include="Spectre.Console" Version="0.45.0" />
    </ItemGroup>

    <ItemGroup>
      <None Update="yelp_labelled.txt">
        <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
      </None>
    </ItemGroup>

</Project>

Let’s start writing some code. Every ML.NET application begins with an MLContext. If you’re familiar with Entity Framework Core, you can think of this instance as your “Unit of Work”. It will contain all your data, the trained model, and its prediction statistics.

At the beginning of the file, add the following lines.

using Microsoft.ML;
using Microsoft.ML.Data;
using Spectre.Console;
using Console = Spectre.Console.AnsiConsole;

var ctx = new MLContext();

Our next step is to load our sentiment data from yelp_labelled.txt. Then, immediately below our context, add the following line.

// load data
var dataView = ctx.Data
    .LoadFromTextFile<SentimentData>("yelp_labelled.txt");

By default, ML.NET expects tab-delimited files, but there are additional parameters to allow for different formats and handling of headers. Next, you’ll need an object to map our data from the CSV to our .NET application. Add the following types to the end of your Program.cs.

class SentimentData
{
    [LoadColumn(0)] public string? Text;
    [LoadColumn(1), ColumnName("Label")] public bool Sentiment;
}

class SentimentPrediction : SentimentData
{
    [ColumnName("PredictedLabel")] public bool Prediction { get; set; }
    public float Probability { get; set; }
    public float Score { get; set; }
}

You may have noticed these types include attributes. These attributes help ML.NET set the data in the correct location and, in the case of ColumnName, apply metadata that you’ll use to train the model. There are a few conventional names, but you can explicitly pass arguments to change the training process in most cases.

Our next step is to split our data into two parts: Training and Testing. ML.NET offers a data structure known as TrainTestData, which has two IDataView properties of TrainSet and TestSet. You can decide how much data each collection will contain, but let’s choose 20% test data for this sample.

// split data into testing set
var splitDataView = ctx.Data
    .TrainTestSplit(dataView, testFraction: 0.2);

Jodie Burchell, our resident data scientist, recommends that you create individual sets for training, testing, and validation as you begin to train models. Train sets are for training, validation for assessing the performance of multiple competing models, and test sets for checking how your model will perform in real-world settings. This ensures reproducible results.

Now that you have the test data, you’ll need to build a training model. First, you should specify the training pipeline for your data. Since you’re doing sentiment analysis, a Binary Classification makes the most sense here: reviews are either positive or negative. Add the following line below the splitDataView variable.

// Build model
var estimator = ctx.Transforms.Text
    .FeaturizeText(
        outputColumnName: "Features",
        inputColumnName: nameof(SentimentData.Text)
    ).Append(ctx.BinaryClassification.Trainers.SdcaLogisticRegression(featureColumnName: "Features"));

Here you’re setting up a processing pipeline to take the tab-delimited information and “Featurize” it. Featurizing text extracts values meant to represent the data. The most simple form of featurizing is taking all the words in the collection and indicating whether the text contains specific words. This is known as count vectorization. Once featurized, you can then pass the values over to be processed by a trainer.

You can choose from multiple trainers, and you should experiment to find the one with the best outcome. You’ll use the SdcaLogisticRegression trainer in this sample because it yields the most accurate result.

So you’ve set up your processing pipeline. Now it’s time to Fit our data and test your prediction accuracy.

// Train model
ITransformer model = default!;

var rule = new Rule("Create and Train Model");
Console
    .Live(rule)
    .Start(console =>
    {
        // training happens here
        model = estimator.Fit(splitDataView.TrainSet);
        var predictions = model.Transform(splitDataView.TestSet);

        rule.Title = "🏁 Training Complete, Evaluating Accuracy.";
        console.Refresh();

        // evaluate the accuracy of our model
        var metrics = ctx.BinaryClassification.Evaluate(predictions);

        var table = new Table()
            .MinimalBorder()
            .Title("💯 Model Accuracy");
        table.AddColumns("Accuracy", "Auc", "F1Score");
        table.AddRow($"{metrics.Accuracy:P2}", $"{metrics.AreaUnderRocCurve:P2}", $"{metrics.F1Score:P2}");

        console.UpdateTarget(table);
        console.Refresh();
    });

You’re using the TrainTestData instance to fit and test the prediction results. You can also ask the MLContext to evaluate the metrics of our classification. Finally, for visualization, you can write them to output using Spectre.Console’s table. If you’ve followed along correctly, you can now run your application, showing the following accuracy rate.

       💯 Model Accuracy                                                                                   
                                                                                                           
  Accuracy │ Auc │ F1Score                                                                              
 ──────────┼────────┼─────────                                                                             
  83.96% │ 90.06% │ 84.38%    

An accuracy of 83.96% could be better, but our data set is limited to 1,000 rows, with 800 records to train our model. Your prediction accuracy may be different based on the randomization of training and test data. That said, it’s enough for the use case of this demo.

Now that you have a trained model, let’s add our REPL experience. In the final step, add the following code after the previous code outside the Start method.

while (true)
{
    var text = AnsiConsole.Ask<string>("What's your [green]review text[/]?");
    var engine = ctx.Model.CreatePredictionEngine<SentimentData, SentimentPrediction>(model);

    var input = new SentimentData { Text = text };
    var result = engine.Predict(input);
    var style = result.Prediction
        ? (color: "green", emoji: "👍")
        : (color: "red", emoji: "👎");

    Console.MarkupLine($"{style.emoji} [{style.color}]\"{text}\" ({result.Probability:P00})[/] ");
}

Rerunning the application, you can exercise the trained sentiment analysis model.

What's your review text? I love this Pizza
👍 "I love this Pizza" (100%) 
What's your review text? This lettuce is bad
👎 "This lettuce is bad" (15%) 

You’ll get some false positives, but that’s based on the limited dataset. 

While you could re-train your model every time you need it, usually, datasets will be large, and training takes time. To make interacting with your trained model more efficient and speed up startup times, you can permanently save the trained model and load it later with a few lines of additional code: 

// save to disk
ctx.Model.Save(model, dataView.Schema, "model.zip");


// load from disk
ctx.Model.Load("model.zip", out var schema);

You can also save models via streams to remote storage services, such as Azure blob storage or AWS S3 buckets.

I’ve included the full application here on GitHub and the complete Program.cs below.

using Microsoft.ML;
using Microsoft.ML.Data;
using Spectre.Console;
using Console = Spectre.Console.AnsiConsole;

var ctx = new MLContext();

// load data
var dataView = ctx.Data
    .LoadFromTextFile<SentimentData>("yelp_labelled.txt");

// split data into testing set
var splitDataView = ctx.Data
    .TrainTestSplit(dataView, testFraction: 0.2);

// Build model
var estimator = ctx.Transforms.Text
    .FeaturizeText(
        outputColumnName: "Features",
        inputColumnName: nameof(SentimentData.Text)
    ).Append(ctx.BinaryClassification.Trainers.SdcaLogisticRegression(featureColumnName: "Features"));

// Train model
ITransformer model = default!;

var rule = new Rule("Create and Train Model");
Console
    .Live(rule)
    .Start(console =>
    {
        // training happens here
        model = estimator.Fit(splitDataView.TrainSet);
        var predictions = model.Transform(splitDataView.TestSet);

        rule.Title = "🏁 Training Complete, Evaluating Accuracy.";
        console.Refresh();

        // evaluate the accuracy of our model
        var metrics = ctx.BinaryClassification.Evaluate(predictions);

        var table = new Table()
            .MinimalBorder()
            .Title("💯 Model Accuracy");
        table.AddColumns("Accuracy", "Auc", "F1Score");
        table.AddRow($"{metrics.Accuracy:P2}", $"{metrics.AreaUnderRocCurve:P2}", $"{metrics.F1Score:P2}");

        console.UpdateTarget(table);
        console.Refresh();
    });

while (true)
{
    var text = AnsiConsole.Ask<string>("What's your [green]review text[/]?");
    var engine = ctx.Model.CreatePredictionEngine<SentimentData, SentimentPrediction>(model);

    var input = new SentimentData { Text = text };
    var result = engine.Predict(input);
    var style = result.Prediction
        ? (color: "green", emoji: "👍")
        : (color: "red", emoji: "👎");

    Console.MarkupLine($"{style.emoji} [{style.color}]\"{text}\" ({result.Probability:P00})[/] ");
}

class SentimentData
{
    [LoadColumn(0)] public string? Text;
    [LoadColumn(1), ColumnName("Label")] public bool Sentiment;
}

class SentimentPrediction : SentimentData
{
    [ColumnName("PredictedLabel")] public bool Prediction { get; set; }
    public float Probability { get; set; }
    public float Score { get; set; }
}

Conclusion

Congratulations, you just wrote your first ML.NET-powered application. If you’ve run the application, you’ll notice the accuracy could be better. Accuracy depends on your dataset, labels, and algorithms. That’s where “science” plays an essential part in building these models. It’s always important to continuously test and verify your results and tune your models as you get new information. ML.NET makes testing easy, and deploying and consuming models is just as straightforward.

I’d love to hear more about your ML.NET journey as you try to build your custom models and deploy them in real-world settings. As always, thanks for reading, and feel free to leave a comment below.

References

image description