Over the Christmas break, I finally had some time to catch up on all the bookmarked articles, saved podcasts, and starred GitHub repositories. While going over them, I noticed that there was one recurring trend that was present in all of them.
All of the new shiny data product ads had the same theme going: “We Work with Modern Data Stack”
The Modern Data Stack (MDS) has been popularized for a couple of years but only recently has there been convergence on its definition. In short, if you use any of the following, you are likely to have the foundational piece of a Modern Data Stack:
- Redshift (AWS)
- Big Query (GCP)
- Synapse (Azure)
But hang on a bit, just having modern tools like Snowflake does not mean you’re modern. It all comes down to how you use it. Before we dive into what is the philosophical and technical indicators of an MDS, let’s first talk about the failures of the traditional data stack.
Recommended for you
Why is MDS gaining popularity?
For one simple reason, the traditional data stack (TDS) is failing to deliver and keep up with the data demands of any modern organization. To maintain a competitive advantage, organizations need data that they can act on at the right time as well as being flexible enough to adapt to changes. A TDS typically refers to on-premise Hadoop (ecosystem) and SQL warehouses that are both logically coupled and complex.
Imagine a TDS like a Christmas light decoration in a box. The lights are needed at the right time for the Christmas party but you discover that some of the bulbs need to be changed. You have to untangle the whole thing to find the broken bulb. After a while, you finally replace the bulb but by the time you are done, the party is over.
However, with the barrage of new tech constantly popping up, it’s understandably difficult to keep up and determine which of the new technologies can bring value to your business. Is this MDS just another buzzword that will fizzle out before a new one is invented?
So, before we dive into what a modern data stack is, let’s first take a look at some of the problems faced by organizations that are still using a TDS.
A typical TDS setup leads to three major problems
1. Long turn-around time to untangle and set up infrastructure
- Companies making use of on-premise infrastructure are responsible for all the costs associated with it, such as the army of engineers required to keep everything maintained and running smoothly.
- Since the setup is so deeply interconnected, what may seem like a minor change might break other parts of the system. Finding the exact logical coupling between the systems requires a lot of work-hours to analyze before any improvements can be made to the existing landscape.
2. Slow response to new information
- As the company grows, so does its data and computational power needs. It is very costly in terms of resources and time when it comes to scaling out (expanding) on-premises infrastructure.
- Since on-premises infrastructure is difficult to scale, this naturally leads to a limit in how much computational power there is to analyze data. Data pipelines can take hours to complete, a problem that is compounded as the organization grows.
- A TDS requires slow ETL (Extract, Transform, Load) operations before newly ingested data can conform to the rest of the data model. A new data update can take weeks and many hours of refactoring before insights appear. By the time the data is ready, the organization is unable to act in time resulting in missed opportunities.
3. Expensive journey to insights
- A lot of the report generation is manually done, especially when the data is coming from different sources. The report is manually generated, manually cleaned, and manually transferred to Excel (gasp!). This leads to errors being made, time being taken away from other business-critical tasks, and the inability to scale.
- Analysts are unable to efficiently perform their roles due to the complex landscape. Data engineers are pulled into operational queries that prevent them from doing their actual jobs (like making the data pipelines more scalable!).
Seeing how competitive the business landscape is and the need to adapt to new information quickly, it’s quite clear that the traditional data stack is not an ideal solution. This is where the modern data stack comes in to help your business remain competitive.
What are the benefits of the modern data stack?
1. Move from IT-focused to a business-focused operating model
- With an MDS, your organization regains the freedom to focus on the business side of things instead of being bogged down by IT-related woes.
- Your organization can have leaner data teams and can focus on the higher value data tasks instead of losing time with the administration and performance optimization of the traditional data stack.
- The tools offered by an MDS are designed with greater accessibility in mind (no-code or little code needed), greatly lowering the technical barrier to entry.
- MDS views self-service as the core functionality, reducing dependencies on the data professionals. This means that the CMOs can extract campaign analytics themselves and view data teams as enablers rather than bottlenecks.
2. Long-term commitments are replaced with plug-and-play flexibility
- Since infrastructure is no longer on-premises and deployed in the cloud, companies no longer have to worry about hardware/platform maintenance and its associated costs (which results in significant savings).
- Storage & compute are available on-tap, improving the data processing response time through the cloud provider’s elasticity.
- The modern data stack makes use of software as a service platform (SaaS), creating out-of-the-box tools. This means that your team can get to work with minimal setup requirements. (Hence the We Work With Modern Data Stack as every new DataOps/MLOps tool slogan)
3. Moving beyond once-off analytics to operational BI and AI
- Modern data stacks are much faster to set up and iterate, eliminating the requirement for large IT teams. This allows for non-tech companies to start generating actionable insights within a few hours, instead of the usual days or weeks.
- Data can come from a variety of first and third-party sources. A modern data stack can integrate all of these sources into its data ingestion tool which in turn will work with business intelligence tools.
4. Treating data governance as a first-class citizen
- We process where problems can be detected and mitigated earlier.
- The tools provided by MDS vendors allow for better data quality, privacy control, and access governance. With the rise of cybersecurity threats, responsible AI, and increasing regulations on data, systems built without data governance in mind is every CIO’s nightmare. Failing to protect data could lead to disastrous consequences for the organization.
- While it’s still a challenge to secure an entire stack, MDS technology providers don’t treat data governance as an afterthought. This results in data governance being part of the process across the entire stack.
What is the modern data stack then?
Quite simply, the modern data stack (MDS) is a set of tools hosted in the cloud that enables an organization for highly efficient data integration. We believe that MDS is the foundation of DataOps and MLOps.
MDS creates clean, trustworthy, and always available data that can empower business users to make self-service discoveries, enabling a truly data-driven culture.
What are the components of an MDS?
The MDS is comprised of multiple layers stacked on top of each other (like a cake) and each layer has its own function.
1. Data ingestion
This is where the data is transported from various sources (databases, server logs, third-party apps, etc) into a storage medium.
2. Data storage
A data warehouse or a data lake (or a lakehouse!) is a (typically cloud-based) solution that is used to store all the collected data sent from the data ingestion tool. Here the data can be accessed and analyzed.
3. Data transformation
Once the raw data has been moved into storage, it will need to be transformed into user-friendly data models. This allows the analysts or data scientists to easily query the data to extract insights, build dashboards or even ML models.
4. Data analytics/ business intelligence
Here the data is analyzed and dashboards are created for users to explore the data. Modern data analytical tools have also been designed with non-technical users in mind. This empowers domain experts to answer business questions without depending on developers and analysts.
5. Data governance
Data catalogues and governance
- Allows organizations to keep track and make sense of their data which helps in data discoverability, quality, and sharing. Without these tools, the data lake can easily become a data swamp.
Data privacy and access governance
- These tools help an organization to stay legally compliant when it comes to data protection. Problems such as data breaches of sensitive data can be mitigated.
Do I need all these different components?
The good news is that, no, you do not need all of them for it to function! An MDS setup is likened to ordering food, you can set it up in a way that matches the needs at that time. For example, you can order a cake but hold the cream. The important thing to note is that despite not having any cream, the end result is that you still have a cake that you can eat.
An MDS setup is modular and designed to be compatible with other components and tools (plug-and-play). This means you can switch components as required by your organization. You can also customize your setup to work with your existing infrastructure instead of deprecating it entirely.
Another advantage of this modularity (as opposed to monolithic) is that you can horizontally pivot the components and avoid vendor lock. Don’t like a particular tool the vendor has for the data storage layer? Swap to a different vendor that fits your needs better. If the organization is young, it most likely does not need all the components at once as its needs are simpler. As the organization grows, it can switch or add in more components as needed.
Examples of different MDS setups
Not all organizations are the same and not everything is one-size-fits-all. Below are examples of the tools that different types of organizations can use in their MDS.
1. Enterprise MDS with both business intelligence and data science requirements
With many organizations having subscriptions to Microsoft 365, PowerBI comes as a natural choice since it is included in the enterprise subscription. As the requirement for real-time reporting becomes increasingly dominant, having structured streaming in combination with PowerBI allows seamless integration into existing analytics architecture.
2. Medium sized analytics team with hybrid/multi-cloud ambition
SMEs have a varying degree of needs and tend to mix and match tools and cloud providers. Snowflake is a suitable choice as it is cloud-agnostic and compatible with most ETL tools. The tools listed are expensive compared to the Azure solutions.
3. Data-driven start up
Start-ups have smaller teams with simpler infrastructure needs, so the tools need to be both cost-effective and easy to use. For example, Metabase is a visualization tool that requires no SQL knowledge to build and does not need the help of BI experts to use.
How difficult is it to set up an MDS?
For organizations embarking on a completely new landscape, it can be incredibly simple as the major cloud providers provide MDS templates (re. AWS Lake Formation). But for organizations with an existing traditional data stack, it is not as simple as moving everything to the cloud.
Careful re-architecting will be critical if you are moving from an existing, matured data stack to the cloud. If your new cloud infrastructure is set up in a coupled-monolithic way (have a bunch of on-premises virtual machines moved to the cloud), you will just be wasting your time.
The next section outlines the important things to watch out for when setting up a new MDS.
Things to watch out for when using an MDS
We have to remember that the MDS is not just meant for professional data scientists, but for anyone who wants to work with data. Since an MDS is modular by design, many organizations tend to find all the best tools and integrate them together. Problem solved, right?
The issue with this approach is that the MDS is now built around the tools and not for the user. While this is fine from an architectural and engineering point of view, it feeds into the most common failure mode: a poor and frustrating user experience.
When implementing an MDS for the first time, the usual approach has been to see what the organization needs and purchase tools accordingly (dashboards, analytics, etc). This, unfortunately, builds up to an MDS that is a disjointed collection of fancy tools; a far cry from the collaborative stack meant for problem-solving.
An MDS with a poorly thought-out user experience will lead to a beautifully engineered data platform with zero adoption from the analysts and scientists they are trying to support.
What’s the difference between a good MDS and a bad one?
It all comes down to one simple concept: the user experience. Just because an organization has the best and most expensive tools does not guarantee there will be harmony. The users of the tools should be able to get the job done without feeling like they’re fighting an uphill battle. Essentially, an organization should build an MDS that is designed around what is best for its users.
Ultimately, it all comes down to user experience. Design your Modern Data Stack by keeping in mind the needs and the pain points of the users:
Be empathetic and inclusive
- The MDS needs to be available and inclusive for all users.
- Allow users to foster trust in the data and encourage collaboration.
- Enable users to carry out the jobs they are supposed to (don’t force analysts to write complex transformations).
Plan thoroughly; start simply
- An MDS does not need to have all of the components to function.
- Plan for the components that your organization needs at the time to avoid unnecessary costs and complexity.
- Start off simply, a simple setup that has ingestion, transformation, and storage is still an effective MDS.
- Expand and add components accordingly.
Find the right partner
- Every organization is different, which means there is no one-size-fits-all solution. You can’t just adopt the same setup as another organization and expect it to work.
- As such, don’t be shy to reach out to vendors to help you design an MDS that is right for your organization. You can even request a demo.
- There are many companies specializing in helping organizations architect and set up modern data stack that is right for their context — be it start-ups or large corporates. Speak to people, read what’s out there and join the array of Slack communities.
Where can I find more information about MDSs?
It’s a massively exploding landscape and is constantly evolving by the day! Here are some resources to get up-to-speed with the Modern Data Stack:
Right after World War II, car manufacturers struggled to keep production costs down and experienced many obstacles during the production process, hurting their profits. Later, Toyota created the Just in Time (JIT) production system which eliminated most of the problems and created efficiency without compromising quality. It wasn’t long before other manufacturers realized the benefits and adopted a similar approach.
Back in the tech industry, organizations are realizing that data is becoming increasingly complex and that their traditional data stacks simply cannot cope. A modern data stack is a solution that can help an organization save time, effort, and money. It is faster, more scalable, and more accessible than the traditional data stack. The MDS also helps an organization transition into a modern and data-driven organization, which is critical for creating business solutions. In this day and age, no organization can remain competitive without actionable data.
These benefits alone are enough of a reason for any organization to seriously re-evaluate its current systems. However, it’s important to not get caught up with the tech buzz and modernize for the sake of modernizing. To truly benefit from an MDS, careful planning is needed to implement a good user experience. Design a good MDS, let your employees do their jobs, and the payoff will be invaluable.
Version and Compare Datasets in Model Training Runs
You can version datasets, models, and other file objects as Artifacts in Neptune.
This guide shows how to:
- Keep track of a dataset version in your model training runs with artifacts
- Query the dataset version from previous runs to make sure you are training on the same dataset version
- Group your Neptune Runs by the dataset version they were trained on
- See if models were trained on the same dataset version
- Compare datasets in the Neptune UI to see what changed