A few months ago we released ML.NET 0.1 at //Build 2018., ML.NET is a cross-platform, open source machine learning framework for .NET developers. We’ve gotten great feedback so far and would like to thank the community for your engagement as we continue to develop ML.NET together in the open.
We are happy to announce the latest version: ML.NET 0.4. In this release we’ve improved support for natural language processing (NLP) scenarios by adding the Word Embedding Transform, improved the speed of linear learners like binary classification and linear regression by adding support for the SymSGD learner, made improvements to the F# API and samples for ML.NET, bug fixes and more.
Additionally, we really want your feedback on making ML.NET really easy to use. We are working on a new API which improves flexibility and ease of use. When the new API is ready and good enough, we plan to deprecate the current “pipeline” API. Because this will be a significant change we want to share our proposals for the multiple API options and comparisons in a future blog post and start an open discussion with you where you can provide your feedback and help shape the long-term API for ML.NET.
The blog post below provides more details about the additions in the 0.4 release.
- Word Embedding Transform for Text Scenarios
- SymSGD Learner for Binary Classification
- Improvements to F# API and samples for ML.NET
Word Embeddings Transform for Text Scenarios
Word embeddings is a technique for mapping words to numeric vectors that are intended to capture some of the meaning of the words, so they can be used for visualization or model training.
The word embedding transform added to ML.NET enables using pretrained word embedding models in pipelines. “Pretrained” means you can use existing embeddings instead of needing to create your own (which takes a lot of data and time). Several different pretrained models are available (GloVe, fastText, and SSWE).
By adding this transform in addition to existing transforms for working with text (like the TextFeaturizer), you can improve the model’s metrics.
For example, we can improve the accuracy of the sentiment analysis sample by 5% if we change the line with TextFeaturizer to:
// Change TextFeaturizer to output tokens (list of words in the text)
pipeline.Add(new TextFeaturizer("FeaturesA", "SentimentText") { OutputTokens = true});
// Add word embeddings
pipeline.Add(new WordEmbeddings(("Features_TransformedText", "FeaturesB")));
// Combine the features from word embeddings and text featurizer into one column
pipeline.Add(new ColumnConcatenator("Features", "FeaturesA", "FeaturesB"));
In the above example, we used the default word embeddings (SSWE: Sentiment-Specific Word Embeddings) which are helpful in sentiment tasks.
SymSGD Learner for Binary Classification
SymSGD is a parallel SGD algorithm that retains the sequential semantics of SGD but offers a much better performance based on multithreading. SymSGD is fast, scales well on multiple cores, while achieving the same accuracy as sequential SGD. It is now available in ML.NET for binary classification.
A related learner, Stochastic Gradient Descent (SGD) is a well-known and effective method for many machine learning problems such as regression and classification tasks. However, its performance scalability is severely limited by its inherently sequential computation.
SymSGD approach is applicable to any linear learner whose update rule is linear, such as binary classification and a linear regression.
Here’s how you add a SymSGD Binary Classifier learner to the pipeline:
pipeline.Add(new SymSgdBinaryClassifier() { NumberOfThreads = 1});
For additional sample code using SymSGD, check here.
The current implementation in ML.NET does not have multi-threading enabled, the issue is tracked by #655, but SymSGD can still be helpful in scenarios where you want to try many different learners and limit each of them to a single thread
Improvements to F# API and samples for ML.NET
Don Syme has been pioneering the work on driving improvements to the overall F# story for ML.NET. As Isaac’s issue had pointed out ML.NET did not support F# records. Work here is still ongoing but with 0.4 release ML.NET allows use of property-based row classes in F#. You can learn more about Don’s work as a part of this PR.
As a part of this change we have also updated the dot.net machine learning samples repo to add the language pivot for ‘fsharp’ porting over the existing samples to work for F# as well. We would love for you try them out and contribute more!
Help shape ML.NET for your needs
If you haven’t already, try out ML.NET you can get started here. We look forward to your feedback and welcome you to file issues with any suggestions or enhancements in the GitHub repo.
https://github.com/dotnet/machinelearning
This blog was authored by Cesar de la Torre, Gal Oshri and Ankit Asthana
Thanks,
ML.NET Team