Machine learning is increasingly popular and the increasing number of companies are harnessing the power of this new technology. Yet the knowledge about the teams themselves is yet limited – what do they use? What do they like? Who are they?
Neptune has been built by data scientists for data scientist – no mumbo-jumbo in between. So when facing the challenge of missing knowledge we do the best we can do – gather the data. Benefitting from our network of contacts and readers whom of many are machine-learning specialists themselves, we have launched a poll to get the answers about most popular technologies, the nature of teams and their daily work.
Results? Read below!
Which area of Machine Learning are you working on?
First question – what are teams working on? Is there any kind of hidden king behind the wall of hyping other technologies? Some peaceful leader of modern machine learning, steady growing in the sterile light of server rooms while media vendors keep chasing the over-pumped ghosts?
We have divided the categories into:
- Computer Vision
- Reinforcement Learning
We found out that the most prominent area in teams was the NLP (Natural Language Processing), followed by Computer Vision, Forecasting, Tabular, Other, and Reinforcement Learning.
According to a report in 2019, 80% of your data will be unstructured in the next 5 years, and we can use this data to build machine learning models for tasks like sentiment analysis, named entity recognition, topic segmentation, text summarization, relationship extraction – you name it.
Considering the advances in the NLP technology more use cases will emerge soon.
You may have heard on your phone calls when you connect to a company’s service care, “This call may be recorded for quality and training purposes,” or asked to fill a survey after you have purchased a service or a product from an organization.
This information can be used to understand an audience’s sentiment polarity about an organization’s product or service.
In the conducted surveys, 26 out of the 31 teams confirmed that they all worked with NLP to deal with textual data.
Almost all the teams have worked in nearly all areas except for Reinforcement Learning.
“Reinforcement learning is an agent-based learning that enables an agent to learn in an interactive environment by trial and error.”
But that’s not really surprising considering fact that the reinforcement learning is still more academic and research area, not the business one. For a serious reason – Reinforcement Learning may not be suitable for a startup is the data required for reinforcement learning is massive, with numerous technical challenges.
OpenAIFive system has an estimated batch size of over 1 million observations. This means we have over a million state-action-rewards to update on each simulation. This makes reinforcement learning samples efficient since a lot of training data is required to reach the desired performance.
For example, AlphaGo, the first AI agent who defeated world champion Lee Sedol in GO, was trained nonstop for a period of a few days by playing millions of games, accumulating thousands of years of knowledge in those simulations that estimated to have cost nearly $3 million for computation power alone.
AlphaGo Zero showed everyone that it is possible to build systems that can defeat even a world champ, but developing this system is still not available to most startups due to the model’s expensiveness.
Key takeaway – work where money are. Either at business oriented fields or for tech giants with wagons of cash for reinforcement learning.
What type of models do you work with?
We asked all of the teams about the methods they use to build a model and list them in four categories.
- Deep Learning
- Boosted Trees(LightGBM, XGBoost, Catboost)
- Linear Regression
Deep Learning models turn out to be the winner here that is used very often by 14 teams.
At least 29 out of 31 teams worked with Deep Learning models at some point.
We can again see here that Deep learning is a winner. Deep Learning is gaining much popularity due to its predictive accuracy when trained on a massive amount of datasets.
Deep Learning methods are applied to social media and unstructured data to better understand the client and segment them. The financial industry is adapting deep learning methods rapidly to build systems for detecting fraudulent transactions, trading, stock predictions, etc.
The healthcare industry is leveraging the power of deep learning for drug invention, disease diagnosis, virtual healthcare, and the list can go on and on.
So again – the teams stick to the technology that delivers the results while keeping its versatility. So the deep learning is a leading technology and nothing seems to change this course.
What do you currently train your ML models on?
Training a machine learning or deep learning model can be daunting sometimes due to limited computational power. On the other hand, a smart data scientist can tweak the model to be less data-hungry and use fewer resources. So we asked all 31 teams about the infrastructure they use for training or building their predictive models.
We can categorize them into 4 categories.
- Local Machine (Laptop Or PC)
- Local Cluster
- Big Cloud (AWS, GCP, Azure)
- Dedicated ML Cloud (Sagemaker, Floydhub)
The least used infrastructure we can see here is the Dedicated ML Cloud (SageMaker, Floydhub), and the most used is here local pc and Big Cloud(AWS, GCP, Azure).
Maybe because training your models on dedicated ML cloud services sometimes can burn a hole in your pocket. They are cool for teams that are heavy on cash – and there are such, but not that many.
Except for the dedicated ML cloud (SageMaker, Floydhub), we can see that model training on all the platforms, whether it is a local machine, Big Cloud or a local cluster were used at some point. It is not that surprising – it is common to run prototype locally at lower cost to save on trial-and-error process in the cloud – in the end, every second costs real cash.
We can see that the most often used infrastructure for training a predictive model is Big Cloud (Azure, AWS, GCP).
Sometimes you may have limited RAM and GPU to train a predictive model, and you just can’t upgrade your system on your whim to meet your system needs; that’s where Cloud services come into play. Or you wish to finish your training in an hour, not in a week – money is not the only value after all.
What are your favorite tools/frameworks/libraries for building and managing ML models?
Data Science teams tend to have every data scientist’s machine learning tasks their own weapons in their framework’s arsenal to tackle any data science problem.
Most used libraries by Data Scientists
Scikit-learn is an open-source library for every data scientist’s machine learning tasks in any corporation or startup. It is built on top of several existing python packages such as Numpy, Scipy, and Matplotlib.
We can perform a wide variety of machine learning algorithms such as regression, classification, clustering, etc. with various performance metrics like MSE (Mean Squared Error), AUC (Area under the curve), ROC (Receiver operating characteristics), etc.
TensorFlow is an open-source library that is developed by Google for making an end-to-end machine learning project.
TensorFlow uses data flow graphs, where data (tensors) can be processed by a series. TensorFlow provides easy model building, ML tools like TensorBoard and ML production.
Most used tools by Data Scientists
Apart from the hardware and tech sides, there is also a day-to-day work done with common tools of trade. Where a blacksmith has his hammer and a carpenter has his chisel, data scientist has…
Yep, here comes everybody’s pal. Jupyter Notebook is a web-based interactive environment for building machine learning models and data visualization. You can perform several other tasks like data cleaning, data transformation, statistical modeling, etc.
Weka quotes “Machine Learning without Programming”. Weka can build machine learning pipelines, train classifiers without having to write a single line of code.
Weka also provides a package for deep learning called WekaDeepLearning4j, which has a GUI for building neural networks, convolutional networks, and recurrent networks directly.
Apache spark offers a machine learning API called MLlib for machine learning tasks like classification, clustering, frequent pattern matching, recommendation system, regression, etc.
Taking the survey gave us insights about the most prominent area, machine learning infrastructures used for ML workflows, most used machine learning methods, why deep learning is overpowering traditional machine learning, and problems with reinforcement learning.
But contrary to the popular approach, this has come from data scientists to data scientists with no overhyped babble aimed to sell anything. The sincerity and accuracy of the answers provided us with an optimistic view of the data science community – we are self aware, full of the common sense and bullshit-resistant when it comes to tech stack or tools we use.
And that’s cool in today’s bloated and pumped-up word.
ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It
Jakub Czakon | Posted November 26, 2020
Let me share a story that I’ve heard too many times.
”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…
…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…
…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”
– unfortunate ML researcher.
And the truth is, when you develop ML models you will run a lot of experiments.
Those experiments may:
- use different models and model hyperparameters
- use different training or evaluation data,
- run different code (including this small change that you wanted to test quickly)
- run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)
And as a result, they can produce completely different evaluation metrics.
Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.
This is where ML experiment tracking comes in.Continue reading ->