Companies are having difficulties with delivering and productionizing AI projects. This is painful and disappointing, and there are plenty of different solutions to problems like this. One of the major solutions is feature engineering.
You see, feature engineering is involved in three of the five general pillars of any AI / ML project:
To get your data sorted out, analyze it, and get all necessary insights from it, you need to perform proper feature engineering. If you don’t have good feature engineering in the front, you won’t get much value out of the back.
Why is feature engineering so important?
In the above chart, you can see that almost 82% of all the work done by data scientists is building, cleaning, organizing, and collecting data. This tells us why feature engineering is the most important aspect of machine learning — it takes up a lot of time, and it has a big impact.
What is feature engineering?
In the image, red dots are negative data points and blue dots are positive data points. Can we use logistic regression to separate the classes? If we draw a plane, there’s no way the plane separates the two classes, so logistic regression can’t be used here. So, we square the data points to classify them, and you can see the result on the right.
Given a point <xi1 , xi2 > , feature engineering transforms them into <xi1’ , xi2’ > . Engineered features become f1 = f12 and f2 = f22. The line drawn across the transformed data points is a hyperplane, and it separates positive and negative points. Finally, thanks to our feature transformation, we can do logistic regression with the obtained dataset.
So, in simple terms, feature engineering goes on from space f1 and f2 to f1’ and f2’, where the data in f1’ and f2’ is linearly separable. If you ask me, I’d say feature engineering is where the art of machine learning happens. If I Were to pick the top techniques for applied machine learning, I’d say they’re:
- Feature Engineering,
- Bias variance tradeoff,
- Data analysis and visualization.
Which transform type should you apply?
One of the big questions of feature engineering is how do you know which transform to apply?
There are tons of data types: categorical, images, graphs, time series; the list goes on. Researchers have spent decades finding the best ways to convert features into numbers. We’re going to explore a few generalized feature engineering techniques.
A very popular method to represent time series data is the Fourier Decomposition, whereas Fourier Transformation is studied extensively in physics, applied mathematics, signal processing, and other areas of science.
Let’s go back to school for a second, and talk about wave properties. Waves have:
- Amplitude – Amplitude is the maximum distance that particles move from their resting positions when a wave passes through.
- Time period – A time period (denoted by ‘T’ ) is the time it takes for one complete cycle of vibration to pass a given point.
- Frequency – Wave frequency is the number of waves that pass a fixed point in a given amount of time.
Any repeating time series consists of multiple sine waves. A time series is the sum of all sine waves. Each sine wave has a different frequency, and each wave will have a different amplitude. In the graph below, we’ve converted time series data into the frequency domain using Fourier transform. Amplitude is on the y-axis, and frequency is on the x-axis.
In a machine learning model, all you want is a feature vector. Given time series data, how do you get the feature vector? Compute the Fourier transform, and we get different frequencies and corresponding amplitudes like (f1,a1), (f2, a2)…
Fourier transform can be applied everywhere, where there is a pattern or simply put it can be applied to check if there is any pattern. On the other hand, “finding a pattern” is something, which you, as a data scientist are trying to do,
Converting the time-series domain is useful for repeating patterns (daily sales in e-commerce sites, heart rate, etc). They’re all domain-specific, so machine learning engineers don’t need much domain knowledge, they can simply collaborate with domain-specific experts to design the features. For example, when working on heartbeat data, ML engineers will collaborate with doctors and signal processing experts to design special features.
There are a few rudimentary featurization techniques in image processing. In images, there are lots of types of data that can be processed using machine learning, like faces, objects in object detection, X-rays, scans, even vehicles in autonomous detection.
Plus, there are many types of images, and over 30 years of research have gone into image featurization. When it comes to histograms, there are two main types:
- Color histograms,
- Edge histograms.
Every image is made of pixels. Each pixel will have 3 values of Red, Green, Blue(RGB), each value ranging from 0 to 255.
In the color histogram, we take all the red values for each pixel. If the image contains n rows and m columns of pixels, we’ll get n * m red values — for every pixel there will be a red value. Now plot the histogram of all the values ranging from 0 to 255. A histogram basically tells you how often the value occurs.
For each histogram, we can convert it into a vector as shown above. For example, let’s assume the value 2 appears 0.3 times. The histogram has distinct values from 0 to 255, and we have converted the image into 3 vectors.
Let us assume our task is to detect the sky in a given picture. If the sky is present in an image, there will be a lot of blue pixels and more blue color vector values than the red color vector. Intuitively, it makes sense to represent the image in color histograms. We can concatenate all the three vectors to form the xi and the corresponding value as yi.
This technique has its flaws. For example, you can’t detect object shapes with color histograms.
So, what about edge histograms? They’re quite interesting. For example:
Assume an image with a red circle. In the image processing area, there are a lot of algorithms to detect the edge. If it’s a red circle on a white background, the edge is the border between these two colors. Edges have a certain angle at which the difference is present. If we break up the image into a grid, we can have all the angles at the edges. In our circle image, if we divide the image into 4 parts, we’ll have 45, 135, 225, 315 angles. When there’s no edge image, we give it a value of -1.
For each region, we’ll get an edge-value/edge-angle, and we can plot the histogram with this. Now, we have a histogram that we can convert into a vector, just the way we converted the color histogram.
Relational data and featurization
Relational data is data stored in tables in databases. For example, a table with customer data like this:
Data stored in this format is called relational data. Typically, relational data is stored in Oracle databases — MySQL and PostgreSQL are popular database choices.
Let’s say our task is to predict customer purchases within the next seven days. So, based on the image above, you’re given a customer ID and Product ID and the response variable as 1 or 0. 1 means the customer will buy the product, and 0 being the customer will not buy the product.
We need to combine the customer ID and Product ID into a vector. This vector will be Xi and the response variable will be Yi. Now, we need to use domain knowledge or common sense:
- A good feature would be if a customer visits a particular product page multiple times over 24 hours, and the data is recorded in the Customer visit table, we can say that there’s a high chance that he/she might buy the product.
- Another interesting feature would be the customer visiting any product in the same category. For example, the customer wants a RAM stick and visits many different companies’ RAM pages.
- Other interesting features could be the ZIPcode where people live, and the standard of living in that area.
All the features above are task-specific. In general, in relational data, we use SQL to make features from the data and domain knowledge.
Graph data and featurization
Featurization in data is very domain-specific. Let’s take a simple graph, like this one of a basic social network:
Here, each user is represented as Ui.and connected to another user through the vertex. An edge happens when two vertices meet UiUj , and let’s say that the edge corresponds to friendship. So, U1 and U2 are friends, U 1 and U7 are friends, but U3 and U7 are not friends.
Let’s assume that our business problem is to recommend new friends to each user. One way to pose this problem is to give Ui and U1 if they’re friends, put a value of 1, if they’re not friends, put a value of 0.
How do we recommend a friend for U4 in the above graph? First, take a look at mutual friends. For users U4, U1 and U2 are already friends. U7 is not a friend to U4 but could be.
Similarly, U3 – U4 can also be connected. Our task is to predict the probability of U3 and U7 being friends with U4. Generally, what you do is find common friends. For U7 in the above graph, there’s only one common friend, and for U3 we have two mutual friends (U1, and U3).
Based on this, if we can create a feature F1 with as many mutual friends, by common sense we can say that the bigger the number of mutual friends, the higher the probability that the pair could be friends.
The second interesting feature can be the path between vertices. Path means all the ways we can reach U3 from U4. Any sequence of vertices from one user to another user is considered a path.
We can design features like this which are called graph-theoretic/graph-based features. This is strictly problem-specific. Whenever you encounter a graph-based problem, you consult a graph-domain expert and curate the features appropriately.
If you have a single feature, X, it can be a numerical feature. There are a bunch of mathematical transformations we can apply, like Log(x), the square of X, a cube of X, square-root of X, or the polynomial function of X.
Similarly, we can apply trigonometric functions, like sine(x), cos(x), tan(x). The big question is: what is the appropriate transform?
It’s very problem-specific. For example, if our single feature X follows the power-log distribution, then the appropriate transformation will be the log function because we know applying a log function to power log distribution roughly converts it into a gaussian distribution.
Some specific featurization works well only for specific types of models. Suppose we have a feature f1, and when we plot its PDF (probability density function), we get a power-law distribution. What if we want to apply logistic regression on top of it?
We know logistic regression is nothing but gaussian Naive Bayes (GNB) with Bernoulli distribution of yi. Gaussian Naive Bayes internally assumes its features are Gaussian-distributed. This is the fundamental assumption that Logistic regression makes.
The feature F1 we have is power-law distributed. In this case, we’re using logistic regression so it makes sense to use log transform to transform F1 into Log(F1). This is very much model-specific because we’re using logistic distribution. We’re applying log transformation to convert features into a Gaussian distribution.
Orthogonality is just a mathematical term. The crux is that the more different/orthogonal the features, the better the model and its performance.
Assume we have features f1,f2,f3 and we need to predict Y. It can be a classification or regression problem.
f1, f2, f3 → Y
Imagine that f1, f2, and f3 are correlated with Y and that’s how it should be. If all the features are correlated to Y, then these features are useful.
If f1 is highly correlated with f2 and f3, the features are internally correlated so the overall effect will be less.
On the other hand, if the same features f1, f2, and f3 are correlated with Y but not among themselves, then the overall impact of combining the features is much better than in the previous case.
Conclusion – real-world applications vs competitions
Real-world machine learning problems are very complicated. They include several stages, each of which is very important and requires attention.
Take the example of an anti-spam system. Just the basic steps are a lot:
- Before any machine learning happens, you need to understand the problem from a business point of view. What do you want to do? For what? How can it help your users?
- Next, you need to formalize the task. What is the definition of spam? What exactly is to be predicted?
- Collect data. You should ask yourself, what data can we use? How to mine examples of spam and non-spam?
- Clean and pre-process your data.
After that, you finally move on to the model building. To do this, you need to answer the questions, which class of model is appropriate for this particular task? How to measure performance? How to select the best model? Next, you check the effectiveness of the model in a real scenario to make sure that it works as expected, and there was no bias introduced by the learning process. Does the model block spam? How often does it block non-spam emails? If everything is fine, the next step is to deploy the model. In other words — make it available to users.
However, the process doesn’t end here. You need to monitor the model performance and re-train it on new data. In addition, you need to periodically revise your understanding of the problem and go for the cycle again and again. In contrast, in competitions, we have a much simpler situation:
In competitions, all things about formalization and evaluation are already done. All data collected and target metrics fixed. Therefore, you mainly focus on pre-processing the data, picking models, and selecting the best ones. But, sometimes you need to understand the business problem to get insights or generate a new feature. Organizers might let you use external data. In such cases, data collection becomes a crucial part of the solution.
Things to care about:
|Choice of target metric||Yes||No|
|Target metric value||Yes||Yes|
With all that being said, featurization is very specific to a particular domain. Of course, no one can be an expert in all the domains and featurization is an art that takes practice. To learn more about featurization, I strongly recommend content from Kaggle. For every competition, you’ll find an interview with the winners and the code they submitted.
For example, if we go to this page – Mercedes-Benz greener masking challenge – we can find all the details, like the person that solved it, their background, and much more. In addition, you’ll also find a nice technical discussion. Winning projects from Kaggle competitions are definitely worth checking out if you want to get better at feature engineering.
Thank you for reading!
MLOps: What It Is, Why it Matters, and How To Implement It (from a Data Scientist Perspective)
13 mins read | Prince Canuma | Posted January 14, 2021
According to techjury, we have produced 10x more data in 2020 compared to 2019. For data scientists like you and me, that is like early Christmas because there are so many theories/ideas to explore, experiment with, and many discoveries to be made and models to be developed.
But if we want to be serious and actually have those models touch real-life business problems and real people, we have to deal with the essentials like:
- acquiring & cleaning large amounts of data;
- setting up tracking and versioning for experiments and model training runs;
- setting up the deployment and monitoring pipelines for the models that do get to production.
And we need to find a way to scale our ML operations to the needs of the business and/or users of our ML models.
There were similar issues in the past when we needed to scale conventional software systems so that more people can use them. DevOps’ solution was a set of practices for developing, testing, deploying, and operating large-scale software systems. With DevOps, development cycles became shorter, deployment velocity increased, and system releases became auditable and dependable.
That brings us to MLOps. It was born at the intersection of DevOps, Data Engineering, and Machine Learning, and it’s a similar concept to DevOps, but the execution is different. ML systems are experimental in nature and have more components that are significantly more complex to build and operate.
Let’s dig in!Continue reading ->