Building machine learning models can be compared to building a home. Of course, a hammer is a great tool when you encounter a nail, but it is pointless to use it when digging a hole. The same goes for machine learning model development – there is no “one tool to rule them all” but a comprehensive set of tools to use to solve a particular problem.
Machine learning is a multidisciplinary field crossing the boundaries of maths, engineering, and software development. But that’s not all – the data scientist needs not only to know the problem but also to have the domain knowledge to deliver a usable solution. The same goes for a builder who wishes to build a house – not only the knowledge of putting bricks together is required, but also a vision of a house and the elementary knowledge of its purpose is crucial.
But what makes Bob the Builder and Bob the Data Scientist similar? Both guys take advantage of their tools and machines to deliver results. The list below shows all the hammers, screwdrivers, and shovels used in – yep! – building any machine learning model.
TensorFlow is a powerful library for numerical computations, particularly for large scale machine learning and deep learning projects.
TensorFlow creates dataflow graphs that describe how the data move through a graph.
What does TensorFlow provide?
- Mathematical computations with GPU support.
- It includes a JIT (just-in-time) compiler that optimizes computations for speed and memory usage by extracting the computation graph, then optimizing it and running operations.
- It facilitates autodiff (Automatically computing gradients is called automatic differentiation, or autodiff).
- Support distributed computing
PyTorch defines a class called Tensor for storing the n-dimensional array to perform tensor computations with GPU support. PyTorch is backed with Caffe2 for its backend.
What does PyTorch provide?
- Well suited for deep learning research with flexibility and speed.
- It provides accelerated computation using GPUs.
- Simple interface
- Computational Graphs
H2O provides an open-source, distributed, fast, and scalable machine learning platform that includes wide varieties of statistical and machine learning algorithms, including gradient, boosted machines, generalized linear models, deep learning, and more.
H2O is fast because it distributes the data across clusters and stores it in a compressed columnar format.
What does H2O provide?
- Amazingly fast because of data distribution in compressed columnar format.
- A simple process of deploying machine learning models into productions.
- Streamlines the process of development
Accord.NET can be used for building computer vision, signal processing, and statistical applications for commercial use.
What Accord.NET provide?
- Provides more than 35 hypothesis tests that include two-way and one-way ANOVA tests, non-parametric tests.
- Interest and feature point detectors.
- Kernel methods for Support Vector Machines, Multi-class and multi-label machines, Least-Squares Learning, etc.
- Parametric and non-parametric estimation of more than 40 distributions.
Shogun is an open-source machine learning software that is written in C++ that supports various languages like Python, R, Scala, C#, Ruby, etc. It was developed by Gunnar Raetsch and Soreren Sonnenburg in the year 1999.
What does Shogun provide?
- Primarily focuses on kernel machines like support vectors.
- Well suited for large scale learning.
- Provides an interface for Lua, Python, Java, C#, Octave, Ruby, Matlab, and R.
Apache Mahout includes implementations for classification, clustering, collaborative filtering, and evolutionary programming. Most of the implementation is built on top of Apache Hadoop for scalability.
What does Apache Mahout provide?
- Support for multiple Distributed Backends (including Apache Spark)
- CPU/GPU/CUDA Acceleration
- Several distributed clustering algorithms such as K-Means, Fuzzy K-Means, Mean-Shift.
- Distributed fitness function implementation for the Watchmaker
7. Apache SINGA
Apache SINGA is an open-source machine learning library that provides a flexible architecture for scalable distributed training.
Apache SINGA focuses on distributed deep learning by partitioning the model and data onto nodes in a cluster and parallelize the training.
What does Apache SINGA provide?
- Improved training scalability by parallelizing the training and optimized computational cost.
- Computational graphs for optimizing the training.
- Python interface to improve usability.
MLLib uses linear algebra packages Breeze and netlib-java for optimized numerical processing. MLLib achieves high performance for both batch and streaming data using a query optimizer and physical execution engine.
What does MLLib provide?
- Machine learning algorithms such as classification, regression, clustering, and collaborative filtering.
- Feature extraction, transformation, dimensionality reduction.
- Provides interfaces in Java, Scala, Python, R, and SQL
- MLLib can run on Hadoop, Apache Mesos, Kubernetes.
9. Oryx 2
Oryx 2 usage lambda architecture that is built on Apache Spark and Apache Kafka for real-time large scale machine learning projects. It includes techniques like Collaborative filtering, Classification, Regression, and Clustering.
Oryx 2 is written in Java, using Apache Spark, Hadoop, Tomcat, Kafka, Zookeeper, and more.
What does Oryx 2 provide?
- A generic lambda architecture tier.
- A specialization in top providing ML abstractions for hyperparameter selection, etc.
- End-to-end implementation of the standard ML algorithms.
RapidMiner provides an integrated environment for data preparation, machine learning, deep learning, text mining, and predictive analytics. Its free version is available under the AGPL license with 1 logical processor and 10,000 data rows.
RapidMiner uses a client/server model with the server offered either on-premises or in public or private cloud infrastructure.
It has GUI based “drag-and-drop” features that allow the user to build data processing workflow.
- Well suited for predictive models.
- Excellent for cleaning and preparing data for a better modeling process.
- Most of the common machine learning algorithms can be integrated easily.
- A great tool for exploring data science and machine learning with its intuitive GUI drag and drop features.
- Data Visualization can be improved.
- Less number of statistical methods.
- Doesn’t have support for building custom models.
The list above shows ten different tools. And that’s the key – different. There were solutions one can launch on his or her laptop to process a Kaggle dataset and enterprise-class solutions that are useless when separated from gargantuan datasets their scalability can flourish on.
So in fact it is not entirely about picking the “favorite” tool – rather “the best’ tool for a particular task. The list above is only a beginning – pick the first one and collect your toolbox as a Bob the Data Scientist would do!
ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It
Jakub Czakon | Posted November 26, 2020
Let me share a story that I’ve heard too many times.
”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…
…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…
…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”
– unfortunate ML researcher.
And the truth is, when you develop ML models you will run a lot of experiments.
Those experiments may:
- use different models and model hyperparameters
- use different training or evaluation data,
- run different code (including this small change that you wanted to test quickly)
- run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)
And as a result, they can produce completely different evaluation metrics.
Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.
This is where ML experiment tracking comes in.Continue reading ->