It always takes at least a few steps to convert any idea into reality. Sometimes, specialists can rely on wireframes or models to prove that an idea is relevant and potentially valuable. But depending only on wireframes is risky. That’s why we also do POC’s.
As the name suggests, Proof of Concept is the minimal working state, or at least a certain working part, of a digital product. A POC demonstrates the feasibility of your idea.
“It’s not intended to explore market demand for the idea, nor is it intended to determine the best production process. Rather, its focus is to test whether the idea is viable.” – TechTarget
In this article, we’ll be covering some of the best tools for getting started when you want to create a machine learning-powered tool, or just want to analyze data for any identified solution. We’ll cover what happens before your machine learning solution is deployed, and how to prove your idea.
Call for POC
POC is a popular method to gauge services and products to validate if certain functionalities or listed requirements are viable. Building a POC is a way to validate things like scalability, technical potential, and more.
In a POC, you’ll be introducing the key functionality of your final product, but on a smaller scale. Creating POCs before the development life cycle is an opportunity to validate your product internally and get feedback early on. This way, you can mitigate risk during later development.
POCs should be built in alignment with long-term requirements as well, i.e. showcasing values across teams and platforms now and in the future.
Data Science/Machine Learning POCs
What constitutes a POC in case of data science and machine learning solutions?
Creating a POC for a data science solution can be different from conventional software, as we must investigate larger prospects to create it. Unlike, say, a web app POC, in ML POCs you can’t just focus on one aspect and build it. The POC should already be trained well for working with unknown data.
Machine learning is a broad field, and it provides solutions to many problems. But before pushing any machine learning solution to production, we should evaluate business value and scalability through a POC.
For example, self-driving car projects – a smart driverless car can’t just be unleashed on the road without considering security and safety measures during the POC phase. It’s important that we consider any future disruptions and enhancement within the POC to deploy a successful solution.
Steps to build a Data Science/Machine Learning POC
POC plays an important role before deploying any machine learning solution. While creating a POC, you will have to think about the business value and larger purpose of POC, and these things will affect the efficiency in different ways.
Considering the challenges, and that every POC might not end up in production, we need to lay out a plan. A plan to create a successful and value-added POC. Here are some of the steps that you can follow for the assessment process:
- Assess the Business Value
When you start working on a POC, you need to define the business value. How will it increase profits? How will it make your process more effective? In some cases, we can also consider collecting customer feedback on the existing process, and from there start thinking about how your idea can improve the experience.
- Capture required data
Now that you know how your solution will increase business value, do you have the data you need to build it? Do you have enough of it, and does it need cleaning and processing?
- Feasibility of implementation
Bring in an expert to get your model complexity checked. A model created for one solution might not provide the same efficiency for another type of data. There can be infrastructural gaps that need to be addressed.
- Defining time frame
It’s very important to define a time frame after consulting business, managers, and clients. A POC is not a complete ML project, so you can’t invest too much time covering each aspect, else it will end up being too complicated.
- Assembling team
By this step, you know what you want from your POC, and what you need to implement it. Now it’s time to get teammates with the right skills. For example, data scientists/analysts who have done some modeling in Python won’t necessarily build an infrastructure ready to handle the required POC. Get people with different skill sets, and try to follow an agile process for the project, so that your team will be more efficient.
Now that we have a plan for creating a POC, let’s find out which tools can help us build a successful POC within a defined time frame.
Tools for Building Machine Learning POCs
TL;DR – quick tool comparison
Before we dive into the details of the tools, you can check out this quick comparison to give you an overview of the tools.
|Tool/Metrics||Built-in algorithms||Supported frameworks and programming languages||Web based||Pricing|
|AzureML||TensorFlow, scikit, SparkML, Python, R||$9.99 per ML studio workspace per month
$1 per studio experimentation hour
FREE TRIAL AVAILABLE
|CloudML||TensorFlow, Keras, Pytorch, scikit, xgboost, Python, R||Price per hour, depends upon the tier
BASIC – $0.19
STANDARD – $1.98
FREE TRIAL AVAILABLE
|AWS SageMaker||Apache MXNet, Apache Spark, Chainer
Hugging Face, PyTorch,
Scikit-learn, SparkML Serving, TensorFlow, Python
|Price per hour, depends upon the type of instance
ml.m4.2xlarge – $0.56
ml.m4.4xlarge – $1.12
|Kaggle||TensorFlow,Pytorch, Keras, MS cognitive tools, Python, R||Free|
|Jupyter||TensorFlow, Spark, scikit,ggplot2, Python, R, Ruby, Scala, Go and Julia||Free|
|Google Colab||Keras, PyTorch, MxNet, OpenCV, xgboost, python||Free|
|Dataiku||Spark, Python, R, SQL||Free trial and free version available
For more features: from $0.01 per year
For a given business problem, when you start with a POC, you will assess both the technical and financial aspects – processors, programming languages, choosing machine learning models, but also cost efficiency, viability and more.
Now we’ll go through every tool in detail, and see how you and your team can use them to create POCs and deploy machine learning models.
“Dataiku is the platform democratizing access to data and enabling enterprises to build their own path to AI in a human-centric way.”
Dataiku offers different capabilities – data preparation, visualization, machine learning, data ops, ML ops, and more. The idea is to let users focus on the requirement while offering you the latest technologies to deliver those requirements. Dataiku has integrated with many tools like Python, R, Scala, Hive and more. You can use any of these tools to build the solution, and Dataiku will integrate the results in further steps seamlessly.
When it comes to machine learning, Dataiku AutoML has many pre-built machine learning models, statistical functionalities, large dataset training capabilities with spark, and more. This can help you create machine learning solutions easily. Dataiku is a good candidate for creating a POC because it supports a wide range of machine learning areas and tools. But it doesn’t limit you to pre-built tools, as it provides notebook support where you can code on your own and experiment.
Dataiku was also listed as a leader in Data Science and Machine-Learning Platforms in the Gartner 2021 Magic Quadrant report.
- Collaborative tool for multiple users to work simultaneously.
- Integrated with many different programming languages, giving you flexibility to work on anything without worrying about learning a new language.
- Simple interface with good data mining tools. Any non-technical person can easily work with the tool.
- Users can code to perform data analysis using their own ML models when they have any specific requirements.
- The free version of this tool has some limitations, for example you can perform data analysis on only 30k rows.
- Dataiku doesn’t support any deep learning capabilities.
2. AWS – SageMaker
“Amazon SageMaker helps data scientists and developers to prepare, build, train, and deploy high-quality machine learning (ML) models quickly by bringing together a broad set of capabilities purpose-built for ML.”
Amazon web services (AWS) has been in the market for a long time, and many enterprises have opted for AWS cloud services. Using AWS offered machine learning services will have the same benefits: CI/CD services for ML, monitoring support, and more.
SageMaker has pre-built tools for each step in ML development. Once the machine learning model or tool is deployed, it has tools (Kubernetes, Edge manager, model monitor, etc) to manage and observe. You can easily start your ML workflow by just clicking and selecting the specifics, and then you can deploy anything, from predictive analytics, through computer vision, to predict churn.
It’s the first fully integrated development environment for machine learning, and users can deploy ML tools at scale. You can use any framework of your choice to experiment with and customize machine learning algorithms. With these available tools and frameworks, creating a POC for an ML solution will be much easier than doing it manually.
SageMaker has combined all four phases of creating ML powered tools, as shown below:
AWS SageMaker has improved the productivity of data scientists up to 10 times and around 89% of deep learning projects are in the cloud on AWS. It also helps you to reduce training costs by up to 90%.
- Extensive collection of built-in machine learning models, you can focus on creating a product rather than tuning and improving models manually.
- Integration with other AWS services, making it a one-stop-shop solution for your ML projects.
- Has a wide collection of tools, but also lets users create their own models.
- Supports TensorFlow, MXNet, PyTorch, and other machine learning and deep learning frameworks.
- Autopilot mode, where it interprets your data and chooses the best performing model after accessing all the available built-in models. In this mode, you can create a tool without writing a single line of code.
- It’s expensive.
- It supports most of the frameworks and their libraries, but not always the updated versions.
- Data can only be read and written to and from Amazon s3.
👉 Check the differences between Neptune.ai and AWS Sagemaker
3. Azure Machine Learning
“Azure ML empowers data scientists and developers with a wide range of productive experiences to build, train and deploy machine learning models and foster team collaboration. “
Azure machine learning service includes tools from classical ML to deep learning. It provides best-in-class support for open-source frameworks and languages (MLflow, Kubeflow, ONNX, PyTorch, TensorFlow, Python, R, etc). Just like AWS, Azure also comes with advanced security, governance, and a hybrid infrastructure.
It boosts your productivity with pre-built tools, but also helps you create effective models by supporting the latest tools. It has machine learning assisted labeling, so you can prepare data quickly. When it comes to processing complicated machine learning solutions, Azure has an auto-scaling feature in place, so you can share CPU and GPU clusters.
When you click on the ‘Notebooks’ option, you’ll be able to launch Jupyter Notebooks from Azure:
Azure provides options to use drag-and-drop functionality, or use Azure notebooks and create your own experiments. In drag-and-drop functionality, you just need to create a step -by-step flow, and all the dirty work will be done by Azure’s ML tools.
With the help of AutoML, you can identify algorithms and hyperparameters, and track experiments in the cloud.
When you are ready to create a POC for machine learning but have minimal budget, Azure can be a good choice. You can get instant access and some credit by signing up for an Azure free account. The resource allocations for Azure Machine Learning compute instances with workspace- and resource-level quota limits, and it’s cost efficient. It accelerates productivity with built-in integration with Microsoft Power BI and other Azure services.
- Improved and proven models from the Microsoft Research team, Bing, and more, which can help you utilize the best solutions.
- Unlike many other tools, it supports R, making it great for any data analysts who use R because of its simplicity and processing time.
- Cheaper compared to other tools. The free tier/free trials can be useful for building POCs
- Cortana Intelligence Gallery, where users can find and exchange knowledge.
- Built-in cognitive APIs help you improve the accuracy of your models.
- Saves your time and money, but sometimes has performance issues, for example if a dataset is too complex or too small, the model accuracy is impacted.
- Signing up to Azure, users get free tier offers, but that offers only a limited number of training hours and functionality.
4. Google Colab
“With Colab you can import an image dataset, train an image classifier on it, and evaluate the model, all in just a few lines of code. Colab notebooks execute code on Google’s cloud servers, meaning you can leverage the power of Google hardware, including GPUs and TPUs, regardless of the power of your machine. All you need is a browser.”
Colaboratory supports python and the code can be executed on a browser with zero configuration. You can easily share your code, because when you create your own Colab notebooks, they get stored in your Google Drive. Colab notebooks are Jupyter Notebooks hosted by colab.
Just type: https://colab.research.google.com/ and you can get started, it’s that simple.
Most libraries are pre-installed, so you can just start working. This is a good option when you want to build an ML POC without configuring anything on your system, and still want to have access to powerful processors. The only issue you might face here is you don’t have any pre-built tool, you’ll have to code on your own.
You can import data into Colab from your local (that will be saved to Google Drive), Google Drive, Github and many other sources, and all your work will be saved on the cloud, which is accessible from any device through the internet. It gives the opportunity to co-code with multiple developers as well – they can review, add comments, assign tasks, and more.
- It’s free, and doesn’t charge you for using TPUs or CPUs.
- Includes a wide range of Python libraries, it’s rare that you have to install any additional Python library.
- All the code is saved automatically to Google Drive.
- Notebooks, dataset files, etc. get saved on Google Drive, so if you’re consuming beyond 15GB, you’ll have to pay for additional space.
- Every new session, you have to install all the specific libraries that aren’t included in the standard Python package.
5. Kaggle Kernels
Kaggle is owned by Google, and Kaggle Kernel is a free platform to run Jupyter notebook in the browser. Kaggle also has a broad supply of real-life data. While creating POCs, if you can’t get data from your client or organization, you can explore Kaggle datasets and find what you need. This is quite like Google Colab, as it’s also free, accessible anywhere, and provides powerful processors.
“Kaggle kernels contain code that helps make the entire model reproducible and enable you to invite collaborators when needed. It’s a one key solution for data science projects from code to comments, and from environment variables to required input files.”
A kernel is a notebook or a script, where using containerization contributors can set up their Kaggle projects. You don’t have to download data, as it will already be mounted in the container.
Kaggle kernel is also a good option if you’re looking to create a POC for an ML-powered application as code can be shared easily. You don’t have to set up anything on your system, and results are readily available in-line to your code. You also have control over sharing your code, and you can keep it private if you don’t want to reveal your code.
- Kaggle has a huge collection of datasets, you can use any dataset for learning or creating POCs without any hassle.
- You can easily upload your datasets on Kaggle, and use them in your notebooks to analyse.
- Kaggle creates commit history, which can be useful while comparing or recovering changes.
- Kaggle has a huge community and active support.
- Kaggle Kernel has only 6 hours of available time for execution per session.
- Many researchers experience lags while executing kernels/ML code.
“Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages.”
Currently Jupyter has four different capabilities:
- Jupyter Notebook
“The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.”
Most cloud notebooks are based on Jupyter Notebook, or at least the idea of it. The difference between those notebooks and Jupyter notebook is that you must get Jupyter installed on your system. Jupyter is also accessible through the browser, and you can see the results of your code and computations inline. You can create a shareable link for your Jupyter notebook using nbviewer.
“JupyterLab is a web-based interactive development environment for Jupyter notebooks, code, and data.”
Jupyter Lab has a modular structure where you can organize multiple notebooks. JupyterLab supports many file formats (images, CSV, JSON, Markdown, PDF, Vega, Vega-Lite, etc) and displays rich kernel output in these formats. It’s an upgrade of Jupyter notebook, and provides similar functionalities in a more organized manner.
- Jupyter Hub
“Jupyter Hub is a multi-user version of the notebook designed for companies, classrooms and research labs.”
Jupyter Hub enriched the notebook functionality and centralized the use of data and code to a group of users. It’s customizable, scalable, and suitable for small and large teams, academic courses, as well as large-scale infrastructure.
“Voilà helps you communicate insights, by transforming a Jupyter Notebook into a stand-alone web application you can share.”
Voilà is useful for presenting your results to someone without sharing the code, which can enhance the reading experience. Voilà is a splendid tool. It’s fast, and we can share results in no-time. It’s highly extensible, flexible, and has high usability.
Looking at the capabilities Jupyter has to offer, it can be the right tool for creating your POC. It’s a one-stop solution to create ML-powered applications, as it offers a notebook, IDE, a dashboard tool, and it’s an open-source tool.
- Jupyter can be installed on your local systems, so unlike cloud services, you don’t always have to depend on internet connectivity.
- Once you have installed your Python packages, you won’t have to install them again.
- It supports Python, R, Ruby, Scala, Go and Julia.
- Different capabilities that you can use on-demand.
- Notebooks can be saved into many different formats – HTML, Markdown, PDF, and more.
- Documenting your code is easier in Jupyter with the cell-based approach.
- Jupyter notebooks often fail due to memory errors.
- No IDE support makes this tool difficult to use when you’re collaborating with a team.
- Running asynchronous tasks is difficult.
Jupyter’s benefits outweigh its drawbacks, which is why most data scientists, analysts and ML practitioners prefer using Jupyter.
7. Cloud AutoML
“Cloud AutoML trains high-quality custom machine learning models with minimal effort and machine learning expertise.”
With Cloud AutoML, developers with limited machine learning expertise can train high-quality models specific to their business needs. Build your own custom machine learning model in minutes. You can create your own custom machine learning models with an easy-to-use GUI.
Google has segregated its AutoML capabilities into four different categories:
Using an AI platform with AutomML users can prepare data, build models, validate, fine-tune it, and deploy it at scale.
Using AutoML vision REST APIs, developers can build tools for object detection, and classify images with custom labels. Developers can also use AutoML video intelligence tools to annotate videos and improve the video experience for customers.
This section covers AutoML translation and Natural Language. Developers can use both by integrating the APIs into their projects. You can use these for sentiment analysis, embedding translation into your app or website.
- Structured data
Using AutoML Tables, developers can build and deploy state-of-the-art machine learning models on structured data.
These AutoML REST APIs can be easily integrated into your projects, and you won’t have to write any object detection or translator model. This very reason makes it one of the best tools to create POCs related to computer vision, NLP, and more.
- Like most cloud services, CloudML also comes with APIs for vision, speech and more, as well as pre-trained models for general purpose solutions, like fraud detection, inventory management, or call centers. This can be useful when you’re building a POC.
- It accelerates the development of new business applications, as you’ll have to do minimal coding to create ML models.
- Flexibility, quickly switch models on-demand.
- You will have access to Google’s distributed network, and will be able to use their GPUs for better processing.
- Tensorflow, PyTorch, Keras, XGBoost, Scikit-learn, and more frameworks are available on CloudML.
- It’s expensive
- The UI can be a bit confusing for some people, especially if you’re new to the tool.
Throughout this article, we explored why building POCs for data science and machine learning is more difficult than conventional software. We also described the best tools to use for building your own POC.
Some of those tools are free to use, others offer premium services for premium prices. While reviewing these tools, you might have noticed how notebooks are different from tools where you have pre-built models, or an option to deploy the POC in large-scale when approved. You can choose any of these tools per your requirements, and get started with your ML-powered POCs. Good luck!
MLOps: What It Is, Why it Matters, and How To Implement It (from a Data Scientist Perspective)
13 mins read | Prince Canuma | Posted January 14, 2021
According to techjury, we have produced 10x more data in 2020 compared to 2019. For data scientists like you and me, that is like early Christmas because there are so many theories/ideas to explore, experiment with, and many discoveries to be made and models to be developed.
But if we want to be serious and actually have those models touch real-life business problems and real people, we have to deal with the essentials like:
- acquiring & cleaning large amounts of data;
- setting up tracking and versioning for experiments and model training runs;
- setting up the deployment and monitoring pipelines for the models that do get to production.
And we need to find a way to scale our ML operations to the needs of the business and/or users of our ML models.
There were similar issues in the past when we needed to scale conventional software systems so that more people can use them. DevOps’ solution was a set of practices for developing, testing, deploying, and operating large-scale software systems. With DevOps, development cycles became shorter, deployment velocity increased, and system releases became auditable and dependable.
That brings us to MLOps. It was born at the intersection of DevOps, Data Engineering, and Machine Learning, and it’s a similar concept to DevOps, but the execution is different. ML systems are experimental in nature and have more components that are significantly more complex to build and operate.
Let’s dig in!Continue reading ->