Feature Stores: Components of a Data Science Factory [Guide]
Ineffective and expensive feature engineering practices often plague companies that work with large amounts of data. This keeps them from organizing a sophisticated machine learning operation. A lot of time is spent fetching the data for ML purposes, but it’s unclear whether there are any inconsistencies between ingestion and service of models.
As a result of this slowed-down process and inability to reproduce results, project stakeholders may lose trust in positive ML outcomes. How can we avoid this situation?
The Feature Store is part of the answer. It’s a crucial part of the data science infrastructure, meant to establish a stable pipeline for end users.
What is a Feature Store?
A Feature Store is a service that ingests large volumes of data, computes features, and stores them. With a Feature Store, machine learning pipelines and online applications have easy access to data.
Implemented as a dual-database, Feature Stores are designed to serve data both in real-time and to be processed in batches:
- Online feature stores serve online applications with data at a low-latency. Examples include MySQL Cluster, Redis, Cassandra DB, etc.
- Offline feature stores are scale-out SQL databases that provide data for developing AI models and make feature governance possible for explainability and transparency. Examples include Hive, BigQuery, Parquet.
Feature Store vs Data Lake vs Data Warehouse
At an abstract level, Feature Stores offer a subset of the functionalities of a Data Lake. Feature Stores are specialized in storing features for machine learning applications, and Data Lakes are a centralized repository for data that goes beyond features, used for analytical purposes as well. On the other hand, data warehouses provide a relational database with a schema to be used by business analysts for generation reports, dashboards and more by writing queries in SQL.
Issues associated with Feature Engineering
- Dynamic nature of feature definitions: A particular feature might end up having different definitions across multiple teams within the organization. If not properly documented, it gets increasingly difficult to maintain the logic behind the feature and the information conveyed by it. Feature Stores aid this problem by maintaining the de-facto definition associated with the feature. This makes it easier for the end-users of the data, and maintains consistency in the results derived from that data.
- Redundancy of features: When data has to be sourced in raw form, data scientists spend a lot of time re-extracting features that might have been already extracted by others in the team or by pipelines currently using the data. Since Feature Stores provide a single source of truth, data scientists can spend less time on feature engineering, and more time on experimenting and building.
- Gap between experimentation and production environments: Products often use a handful of programming languages and frameworks, which could be different from the tools used to experiment with a machine learning model. This gap can cause inconsistencies that may get overlooked, and ultimately worsen the model/product on the customer’s end. In other words, the expected behavior/performance of the model in development should more or less be reproducible when deployed in the product. With Feature Stores, this transformation is unnecessary, as they maintain consistency between experimentation and production.
Benefits of a Feature Store in ML pipelines
The output from Feature Stores is implementation-agnostic. No matter which algorithm or framework we use, the application/model will get data in a consistent format. Another major benefit of using a Feature Store is saving time that would otherwise be spent computing features.
Components of a Feature Store
- Feature Registry: Feature Registry provides a central interface that can be used by data consumers, for example data scientists, to maintain a list of features along with their definitions and metadata. This central repository can be updated to meet your needs.
- Operational Monitoring: Machine learning models may perform well during initial stages, but you must still monitor them for correctness and reliable results over time. There’s always the possibility of a degrading relationship between the independent and dependent variables in a machine learning model. This is primarily due to the complex nature of incoming data, because of which the predictions become more unstable and less accurate with time. This phenomenon is called Model Drift and can be classified into two types:
- Concept Drift: Statistical properties of the target variable change over time.
- Data Drift: Statistical properties of the predictor variable change over time.
The Feature Store maintains data quality and correctness. It provides an interface to internal and external applications/tools to ensure proper model performance in production.
- Transform: In machine learning applications, data transformation pipelines absorb raw data and transform it into usable features. Feature Stores are responsible for managing and orchestrating them. There are three main types of transformations:
- Stream for data with velocity (for example real-time logs),
- Batch for stationary data,
- On-demand for data that cannot be pre-computed.
- Storage: Features that are not immediately required are offline, and usually stored in warehouses such as Snowflake, Hive, or Redshift. On the other hand, online features are required in real-time and stored in databases such as MongoDB, CassandraDB, or Elasticsearch, with low-latency capabilities.
- Serving: The logic behind feature extraction and processing is abstracted, which makes Feature Stores very attractive. While data scientists access the snapshot of data (point-in-time) for experimentation purposes, Feature Stores ensure that these features are constantly updated in real-time, and readily available to the applications that need them.
Feature Stores and MLOps
What is MLOps?
MLOps is about applying DevOps principles to building, testing and deploying machine learning pipelines. With MLOps practices, teams can deploy better models, more frequently. Challenges of MLOps include:
- data versioning,
- managing specialised hardware (GPUs, etc.),
- managing data governance and compliance for models.
MLOps: What It Is, Why it Matters, and How To Implement It
The Best MLOps Tools You Need to Know as a Data Scientist
MLOps vs DevOps
Traditionally, developers use Git to version control code over time, which is necessary for automation and continuous integration (CI). This makes it easy to automatically reproduce an environment.
Every commit to Git triggers the automated creation of packages that can be deployed using information in version control. Jenkins is used alongside version control software to build, test and deploy code, so that it behaves in a controlled and predictable way. The steps involved with Jenkins are:
- Provisioning of Virtual Machines (VMs)/containers.
- Fetching code onto these machines.
- Compiling the code.
- Running tests.
- Packaging binaries.
- Deploying binaries.
The most important part of MLOps is versioning data. You can’t do it with Git, as it doesn’t scale to large volumes of data. A simple machine learning pipeline consists of the following:
- Validated incoming data.
- Computation of features.
- Generation of training and testing data.
- Training of the model.
- Validation of the model.
- Model deployment.
- Monitoring the model in production.
This can get even more complex when you add hyperparameter tuning, model explainability and distributed training into the picture.
Orchestration frameworks help in automatic workflow execution, model retraining, data passing between components, and workflow triggering based on events. Some of these frameworks are:
- TensorFlow Extended (TFX) – supports Airflow, Beam, Kubeflow pipelines
- Hopsworks – supports Airflow
- MLFlow – supports Spark
- Kubeflow – supports Kubeflow pipelines
TFX, MLFlow and Hopsworks support distributed processing with Beam and Spark to enable scale-out of execution on clusters using large amounts of data.
Machine Learning pipelines
DevOps CI/CD is mostly triggered by source code updates. MLOps and DataOps CI/CD pipelines may be triggered not just by source code updates, but also data updates and data processing:
- DataOps mostly automates the testing and deployment of data pipelines,
- MLOps automates the process of training, validating and deploying models.
For end-to-end machine learning pipelines, you need very deliberate feature engineering. This can take up the majority of your bandwidth. Feature Stores can help in two ways:
- Ingesting data, validating it, and transforming it into consumable features.
- Machine learning algorithms that consume this data get trained, validated and pushed into production.
The issue with machine learning pipelines is their stateful nature. A good data pipeline should be stateless and idempotent. In other words, we need a lot of information before deploying a new model (in validation stage) about how well it’s performing, what are the assumptions we’re making, the impact of the model, and so on. Usually, developers end up re-writing code over and over again to define input and output properly.
Hopsworks offers an inobtrusive metadata model, where pipelines read/write to the HDFS and interact with the feature store using the Hopsworks API. This way we can store metadata, artifacts, model provenance and more, without re-writing code as required by TensorFlow Extended (TFX) or MLFlow.
Some of the industrial best practices along with relevant tools to help us achieve them are as follows –
- Unit test and continuous integration with Jenkins.
- Data validation using TFX or Deequ, so that features have expected values.
- Test for uniqueness, missingness and distinctiveness using Deequ.
- Check for data distribution validation using TFX or Deequ.
- Pairwise relationship between feature and with the target variable using Deequ.
- Custom tests to measure cost of each feature.
- Test for Personally Identifiable Information (PII) leaks.
Feature Store architectures
Uber’s Michelangelo Machine Learning platform
In 2017, Uber introduced Michelangelo as ML-as-a-service platform to make scaling up AI easy. With an ever growing customer base and huge influx of rich data, Uber has deployed Michelangelo across its multiple data centers for running their online applications. The platform was born out of a necessity. Before Michelangelo, data scientists and engineers had to create separate predictive models and bespoke systems which in terms of scaling up was not sustainable.
Michelangelo is built on top of Uber’s infrastructure, with components that are built in-house as a bootstrap of mature open-source frameworks such as HDFS, Spark, XGBoost, or TensorFlow. It provides a Data Lake for all transactional data. Kafka brokers are deployed to aggregate data from all of Uber’s services, and streamed via Samza compute engine with Cassandra clusters.
The platform uses Hive/HDFS to store Uber’s transactional and log data. Features needed for online models are precomputed and stored in CassandraDB, where they can be read at a low-latency at prediction time.
For feature selection, transformation, and to ensure that input data is checked for proper format and missingness, Uber developers built their own Domain Specific Language (DSL) as a subset of Scala. With it, end-users can add their own user-defined functions. The same DSL expressions are applied during training and prediction, for guaranteed reproducibility.
Michelangelo supports offline, large-scale distributed training of a variety of machine learning algorithms and deep learning networks, which makes it very scalable. The model type, hyper-parameters, data sources, DSL expressions and compute resource requirements are mentioned as a model configuration. It also provides hyper-parameter search.
The configured training job runs on a YARN or Mesos cluster, after which performance metrics are calculated and compiled into a report. When running partitioned models, training data is automatically partitioned based on model configuration, and trained on the same. The parent model is used when needed. Information regarding the final model is stored in the model repository for deployment, and the report is saved for future analysis.
Training jobs can be managed via the Michelangelo UI, API, or even through a Jupyter Notebook.
When training is complete, a versioned object containing the following information is stored in CassandraDB:
- Author of the model,
- Start and end time of training job,
- Model configuration,
- Reference to training and testing data,
- Feature level statistics,
- Model performance metrics,
- Learned parameters of the model,
- Summary statistics.
Google’s Feast: an open-source feature store
Feast is an open-source feature store for machine learning for making the process of creating, managing, sharing, and serving features easier. In 2019, Gojek introduced it in collaboration with Google Cloud.
It uses BigQuery + GCS + S3 for offline features, and BigTable + Redis with Apache Beam for online features.
- Feast Core: This is where all the features and their respective definitions coexist.
- Feast Job Service: This component manages data processing jobs that load the data from sources into stores, and jobs that export data used for training.
- Feast Online Serving: Online features require low-latency access to them which is facilitated by this component.
- Feast Python SDK: This is used to manage feature definitions, launch jobs, retrieve training datasets and online features.
- Online Store: It stores the latest features for each entity. It can be populated by either batch ingestion or streaming ingestion jobs for a streaming source.
- Offline Store: Stores batch data used to train AI models.
How does it work?
- The log-streaming data is ingested from applications.
- Stream processing systems like Kafka and Spark are used to convert this data into stream features.
- Both the raw and stream features are then logged into the data lake.
- ETL/ELT transform data in the batch store.
- Features and their definitions are then established on the Feature Core.
- The Feast Job service polls for new and updated features.
- Batch ingestion jobs are short-lived, they fetch data into offline and online stores.
- Stream ingestion jobs are long-lived, they fetch from streaming sources and provide to online applications.
- A machine learning pipeline is launched, data is used, all controlled by the SDK.
- According to model configurations, feast provides point-in-time training data and features.
- The trained model is then served and the backend requests for prediction from the model serving system.
- Model Serving System requests online features from Feast Online Serving.
- Model Serving System makes predictions on online features and returns results.
Hopswork’s Feature Store
Data Engineers are primarily responsible for adding/updating features, which could be computed with SQL queries or even complex graph embeddings, using notebooks, programs written in Python, Java or Scala, and even Hopsworks’ UI. Programs ingest data in the form of Pandas or Spark dataframes.
Feature data is validated before ingestion using the Data Validation API. A UI platform is provided by Hopsworks for establishing data validation rules through which feature statistics can also be viewed. Hopsworks also supports the creation of more than one feature store, because one feature store should not necessarily be accessible to all parts of an enterprise.
Data scientists use Feature Stores to split data into training and testing sets for building machine learning models. Online applications use it to create a feature vector which is later used for inference. In addition to this, users can also query for point-in-time data.
Features are measurable properties, wherein each feature belongs to a Feature Group with an associated key for computation. Data Scientists can generate training and testing data by providing/selecting a set of features, the target file format for the output of features (CSV, TFRecords, Numpy, etc.), and the target storage system (GCS, AWS S3, etc.).
There are two ways to calculate feature groups:
- On-demand: There is built-in support for external DBs that lets you define features on external data sources.
- Cached: The Hopsworks Feature Store can scale up to peta-bytes of feature data.
Hopsworks provides great documentation on how to use their API and get started with its feature store.
Tecton’s Feature Store
While most organisations have taken up the initiative to build feature stores for internal use, Tecton has been building their platform to be provided as a service to various enterprises. Their founding members originally were at Uber, where they had built Michelangelo. Taking inspiration from Uber’s product, Tecton built and started offering its services as well. They also contribute to Google’s open-source feature store, Feast.
- Feature Stores quicken and stabilise the process of extracting, transforming data, engineering features, and storing them for easy access for both offline and online needs.
- Feature stores if not already ubiquitous should be the next must-have step for every organisation that aims to build the best AI products without having to lose their bandwidth on operational purposes.
- Traditional CI/CD pipelines are not suitable for handling data and machine learning models, which introduces the requirement of MLOps and DataOps.
- Some of the examples of feature stores are Uber’s Michelangelo, Google’s Feast, Hopsworks’ Feature Store and Tecton’s Feature Store.