Data Science is mainstream, and many organizations in the most competitive markets are beating the competition by being more data-driven.
An increasing number of new software projects involve aspects of data science. This creates a problem. Without standardised methodologies for managing data science projects, teams often rely on ad hoc practices that are not repeatable, not sustainable, and unorganized.
Such teams suffer from low project maturity without continuous improvements, well-defined processes and check-points, or frequent feedback (Saltz, 2015).
In software engineering, we have something called CMM (Capability Maturity Model). It’s an improvement approach/methodology that you can use to develop and refine your process for building software.
CMM has 5 process maturity levels:
- Quantitatively managed
So when I mentioned ad-hoc practices, they’re at the bottom level of CMM – Initial. Lots of unknowns, poor control, and not enough planning, which often leads to disasters.
When to use ad-hoc processes in DS Management?
Loose, ad-hoc processes have a valid place in data science. When you’re starting a project, loose processes give you the freedom to decide how to tackle each problem as it comes up. Which naturally means that there’s a lot of trial and error involved.
Ad-hoc processes might be most appropriate for one-off projects coordinated by individuals and small teams (Saltz & Shamshurin, 2016).
Note: I’ll be using the terms “methodology” and “approach” interchangeably.
Using this loose approach, you can start working on a project quickly, with minimal administrative overhead, and no need to comply with multiple specific procedures.
However, data science is no longer siloed away. It has evolved into a team effort, involving professionals with diverse skill sets beyond data science (Spoelstra, Zhang, & Kumar, 2016).
This calls for standardized project management. Let’s see how mature software project management methodologies can be used for data science. We’ll take a look at:
- Waterfall methodology
- Does waterfall methodology even work for data science project management?
- Agile methodology
- Should we use Agile for data science project management?
- Hybrid methodology
- When should we use a Hybrid approach?
- R&D methodology
- Should we use the R&D approach for data science project management?
Introduced by Winston Royce in 1970, Waterfall is the oldest paradigm for software engineering. Waterfall is a sequential model, divided into pre-defined phases.
Each phase must be completed before the next can begin. There’s no overlap between phases. These phases include: scope, requirements, design, build, test, deployment and maintenance.
Waterfall starts with an initial phase and cascades sequentially in a straight line toward the final phase. You can’t revisit previous phases, and if you need to, it means you planned poorly.
This approach requires significant project governance including reporting, risk management, handoffs and documentation.
“Although the original Waterfall model proposed by W.Royce made provision for “feedback loops”, the vast majority of organizations that apply this approach treat it as if it were strictly linear.” (R.Pressman)
When to use the Waterfall model?
You should use it only when the requirements are very clear and fixed, the product definition is stable, your team knows the technology stack through and through, and the project is short.
Waterfall functions best when the project stages, from communication to deployment, progress in a linear fashion. This often happens in cases where an existing system needs an improvement, like adapting to a facial recognition system that has become necessary.
Does Waterfall methodology even work for Data Science Project Management?
No, it doesn’t. Waterfall is most effective when the technology is well-understood, the product market is stable, and requirements are not likely to change during the course of the project (Pressman & Maxim, 2015).
That is the exact opposite of how data science projects are. In data science, there’s a lot of experimentation, requirements always change, and the tech is still very novel.
However, there are some things we can still take away from this approach. For example, the well-structured nature of Waterfall works well for certain phases of a data science project, such as scope, planning, resource management, or validation.
- Scope/Contracts – the typical large-scale data science project still requires control and risk management, especially around consumer data rights, business case measurement, contract, and payment management where third parties are involved.
- Organisational planning and culture – large organisations have enterprise-wide approaches to project management, which tend to be waterfall in nature. Data science projects may have to dovetail into these approaches. These can include RAID management (Risks, Assumptions, Issues and Dependencies), programme governance reporting and budget tracking.
- Organisational resource management – highlighting dependencies and bottlenecks is key to keeping on track with project milestones.
We’ll talk a bit more about this when we reach Hybrid methodologies. For now, let’s move on to Agile methodologies.
Agile methodology was founded on 4 core values and 12 principles (Agile manifesto). This approach is all about iterating and testing throughout the software development life-cycle.
Development and testing activities happen at the same time, which would never happen in the Waterfall model.
When to use Agile methodologies?
Agile was created as a response to the fast-paced, ever-changing IT industry. Requirements for software change all the time, so developers need to be able to adapt quickly.
If you’re doing any software development right now, you’re using Agile. It’s the standard, and it has grown beyond software projects. Business, marketing, design, and other teams in many organizations use Agile project management.
Should we use Agile for Data Science Project Management?
Agile is perfect for data science, and here are some reasons why:
1. Data scientists and business owners agree on high-level requirements (the project backlog) early in the development lifecycle, and keep reviewing them as the project grows.
The business owner can review the model/solution, and make decisions and changes throughout the development process.
Gathering and documenting detailed requirements in a meaningful way is often the hardest part of data science projects. The business owner may not have a detailed view of the necessary data quality, the precise business outcome to be modelled, or how the model outputs will be integrated with decision systems.
2. Agile data science produces evolving models and releases that are very user-focused, thanks to frequent reviews and direction from the business owner.
3. The business owner gains a strong sense of ownership by working directly with the development team throughout the project.
Now that we have seen that agile and data science are a good fit, let’s look at different Agile frameworks – their benefits, tools, pros & cons, and best practices.
Scrum is an Agile collaboration framework that can make your team more flexible. It makes it easier to build complex software applications, by giving you easy solutions for complicated tasks.
Scrum is all about using a set of “software process patterns” that have proven effective for projects with tight timelines, changing requirements, and business criticality. Each of these process patterns defines a set of development activities.
Foundational Scrum Concepts
Scrum divides the larger project (the product backlog) into a series of mini-projects, each with a consistent and fixed length, from one week to one month. Each mini-project cycle, called a sprint, kicks off at a meeting called sprint planning, where the product owner defines and explains the top feature priorities.
The development team estimates what they can deliver by the end of the sprint, and then makes a sprint plan to develop these deliverables. During the sprint, they coordinate closely and develop daily plans at daily standups (up to 15 minutes long). At the end of a sprint, the team demonstrates the deliverables to stakeholders, and gets feedback during sprint review.
These deliverables should be potentially releasable, and meet the agreed upon definition of done. To close a sprint, there’s a sprint retrospective, where the team members plan how to improve their work.
Should we use Scrum for Data Science Project Management?
Scrum helps your team collaborate and deliver incremental value. But it’s not always easy. One challenge is that defining fixed-length sprints might be difficult in a data science context. It’s hard to estimate how long a task will take.
Data science teams often want to have sprints of varying duration, but this is not possible when using Scrum.
Due to these challenges, some teams use Data-Driven Scrum (we’ll see it in the Hybrid methodologies section). DDS has some of the key concepts of Scrum, but also addresses the key challenges of using Scrum in a data science context. Another good alternative is Kanban (sometimes used together with Scrum).
Best Scrum practices
- Prioritize “Spikes”: To allow for research and discovery, teams could create spikes – items in the backlog that provide time for intense research. These spikes sit alongside product increment ideas in the backlog. When brought into the sprint, they’re considered “done” when the specified research objectives are met, or the time limit expires (Mezick, 2017).
- Divide Work “Smaller” and Conquer: Data scientists frequently complain that their work is too ambiguous to estimate how much effort a task may need. A possible solution is to divide the work increments into smaller pieces that are definable and estimable (Mezick, 2017).
- Shorten Sprints: Daniel Mezick says that “the closer something flirts to chaos, the way to figure out what to do is through frequent inspection.” He recommends dividing sprint cycles into shorter time periods to force these more frequent inspections (Mezick, 2017).
- Occasionally Relax “Definition of Done”: To avoid full-blown testing for exploratory work and proofs of concept, where delivery speed may be more important than being “done”, teams could agree to relax the definition of done for certain features. However, teams should ensure that they don’t fall down a slippery slope of accepting lower-quality output for core deliverables.
- Renegotiate Work during the Sprint: Contrary to popular misconception, the sprint plan is not written in stone at sprint planning. Rather, “Scope may be clarified and re-negotiated between the Product Owner and Development Team as more is learned” if it doesn’t “endanger the Sprint Goal” (Sutherland & Schwaber, 2017).
- Build an “Architectural Runway”: Teams that need a new larger-scale architecture may find careful upfront planning more effective than allowing architecture to develop through emergent design. Diverging from the customer-centric focus of Scrum, data science teams may take a concept from the Scaled Agile Framework (SAFe) and dedicate some initial sprints to develop an architecture for themselves (Scaled Agile, Inc, 2017).
- Do the Least Understood Work First: To reduce risk, a data science team can focus early development cycles on exploratory work and proofs of concept to get familiar with the data. Akred recommends his teams to “Frontload the parts you don’t understand and then as we get confident in our ability we start building the things around it that actually turn that thing into a usable system” (Akred, 2015). If the team is unable to prove feasibility within a reasonable number of cycles, it can re-focus on other work and avoid unnecessary losses.
- Integrate with CRISP-DM: To address the data science process, CRISP-DM can be integrated with Scrum to manage data science projects (more on this in the Hybrid section).
Tools for Scrum
The most popular tools used for Scrum in 2020 were:
Pros of Scrum
- Customer Focus: Scrum focuses on delivering customer value.
- Regularity: Rigid time boundaries make teams used to a regular workflow.
- Autonomy: By providing teams with broad autonomy to self-govern, they become happier, more productive, and more engaged.
- Improvement Through Inspection: Teams are called to constantly inspect themselves, especially during retrospectives at the end of each sprint. This can help teams accelerate their performance through each development cycle by learning and adapting from previous cycles.
- Empirical Evidence: Like data science, Scrum is founded on the principle of execution based on what is known. “Empiricism asserts that knowledge comes from experience and making decisions based on what is known. Scrum employs an iterative, incremental approach to optimize predictability and control risk”.
Cons of Scrum
- Time Boxing Challenges: Although it provides numerous benefits, the time boxing feature of Scrum is controversial, especially for data science teams. The time required to implement a solution for most data science problems is ambiguous.
- “Potentially Releasable Increment” Challenges: This challenge is particularly daunting for data science teams, and might not even be necessary. Much of the data science process, especially in the early exploratory phases, may not be intended for release outside the data science team. Moreover, testing requirements may extend beyond what’s reasonable in a sprint, especially if a high degree of accuracy is needed (Jennings, 2017).
- Meeting Overhead: Scrum meetings may require up to four hours per week, plus additional time for backlog refinement and backlog management (Sutherland & Schwaber, 2017). These meetings might be viewed as overhead that should be avoided (Beverly, 2017).
- Difficult to Master: In a controlled experiment, student teams using Scrum in a data science project performed badly, largely because of their inability to understand the methodology and to set up clear sprints (Saltz, Shamshurin, & Crowston, 2017). Clearly, there’s a quite steep learning curve to Scrum.
This is a tough one. It made the list, but it’s not exactly a project management approach.
Cross-Industry Standard Process for Data Mining was originally designed for data mining, and is an “approach” with six phases that naturally describe the data science lifecycle. It provides anyone, from novices to experts, with a complete blueprint for conducting a data science/mining project.
Crisp-DM: phases & tasks
- Business understanding: Exploring project objectives and requirements from a business perspective, converting this knowledge into a data science problem definition, and designing a preliminary plan to achieve the objectives.
- Data understanding: Initial data collection, then EDA to get familiar with the data, identify data quality problems, discover first insights into the data or to detect subsets to form hypotheses for a quick experiment.
- Data preparation: We all know that this phase is usually the most time-consuming and boring. It’s no fun having to clean and reformat the data by manually editing a spreadsheet, or writing custom code. But this phase covers all activities to construct the final dataset from the initial raw data. The clean data can help identify outliers, anomalies and patterns that can be usable in the next steps.
- Modeling: This is the core activity of a data science project, it requires writing, training, testing and refining the models to analyse and derive meaningful business insights from data. Diverse ML techniques are applied to the data to identify the ML model that best fits the business needs.
- Evaluation: In the previous stage, we ran multiple experiments with different algorithms. Now we get to select the best one that fits the business needs and offers the best accuracy-to-performance ratio.
- Deployment: In this phase, we usually deploy our ML models into a pre-production or test environment. Then, once we’re satisfied with the live tests, we deploy it and move into MLOps (read more about MLOps in this article).
Should we use CRISP-DM for Data Science Project Management?
Let’s look at some data. According to a 2014 pool done by kdnuggets, the 2 main methodologies that were used for analytics, data mining, or data science projects were:
- CRISP-DM – 43%
- My own (hybrid) – 27.5%
CRISP-DM was the popular methodology in 2014 and all the previous years, going back as far as 12 years.
Funny enough, in another pool by datascience-pm.com from August and September 2020, CRISP-DM was the undefeated champion of the data science community.
That’s it for the data, I’ll also give you my opinion as a data scientist.
I think CRISP is a great starting point if you’re forming a new data science division/team. The different phases naturally describe the data science process, which makes it easier to adopt.
It also encourages interoperable tooling. You can create a completely custom solution for your project by leveraging many different microservices, instead of going for an end-to-end monolith tool – you avoid SPOF (Single Point of Failure), and if a tool fails, you simply plug in a new one.
Although well-designed and having stood the test of time, CRISP-DM fails to cover modern parts of the data science process, which can cause problems. But that’s nothing to worry about, because we can overcome these problems using the best practices below.
Best CRISP-DM practices
Here are five tips to overcome the weaknesses of CRISP-DM:
- Iterate quickly: Don’t fall into a Waterfall trap by working thoroughly across layers of the project. It’s better to think and work asynchronously. Focus on iteratively providing more value. Your first deliverable might not be too useful. That’s okay. Iterate.
- Document enough…but not too much: If you follow CRISP-DM precisely, you might spend more time documenting than doing anything else. Do what’s reasonable and appropriate, but don’t go overboard.
- Don’t forget modern technology: In your project plan, add steps to leverage cloud architectures and modern software practices like git version control and CI/CD pipelines.
- Set expectations: CRISP-DM lacks communication strategies with stakeholders. So be sure to set expectations and communicate with them frequently.
- Combine with a true project management approach: As a more generalized statement from the previous bullet, CRISP-DM is not truly a project management approach. So combine it with a data science coordination framework. Popular Agile approaches include:
- Data-Driven Scrum
Tools for CRISP-DM
As we discussed earlier, this approach encourages interoperable tools across the entire data mining process (the same for data science), so CRISP-DM is tool neutral.
Pros of CRISP-DM
- Generalizable: Although designed for data mining, William Vorhies, one of the creators of CRISP-DM, argues that because all data mining/science projects start with business understanding, have data that must be gathered and cleaned, and apply data science algorithms, “CRISP-DM provides strong guidance for even the most advanced of today’s data science activities” (Vorhies, 2016).
- Common Sense: In one study, teams which were trained and explicitly told to implement CRISP-DM performed better than teams using other approaches (Saltz, Shamshurin, & Crowston, 2017).
- Easy to Use: Like Kanban, CRISP-DM can be implemented without much training, organizational role changes, or controversy.
- Right Start: The initial focus on Business Understanding is helpful to align technical work with business needs, and to steer data scientists away from jumping into a problem without properly understanding business objectives.
- Strong Finish: Its final step, Deployment, addresses important considerations to close out the project, and simplifies the transition to maintenance and operations.
- Flexible: A loose CRISP-DM implementation can be flexible to provide many of the benefits of Agile principles and practices. By accepting that a project starts with significant unknowns, the user can cycle through steps, each time gaining a deeper understanding of the data and the problem. The empirical knowledge learned from previous cycles can then feed into the following cycles.
Cons of CRISP-DM
- Documentation Heavy: Nearly every task has a documentation step.
- Not Modern: Counter to Vorheis’ argument for the sustaining relevance of CRISP-DM, others argue that CRISP-DM, as a process that predates big data, “might not be suitable for Big Data projects due its four V’s – volume, variety, velocity and veracity” (Saltz & Shamshurin, 2016).
- Not a Project Management Approach: Perhaps most significantly, CRISP-DM is not a true project management methodology because it implicitly assumes that its user is a single person or small, tight-knit team and ignores the teamwork coordination necessary for larger projects (Saltz, Shamshurin, & Connors, 2017).
Kanban (literally “billboard” in Japanese) started as a supply chain and inventory control system for Toyota manufacturing in the 1940s. It minimized work in progress, and matched the supply of automotive parts with the demand. Other industries, including software, have since adopted Kanban because of how useful it is.
Kanban starts with a list of potential features or tasks, similar to the backlog concept of Scrum. They’re placed in the initial To Do column of a Kanban board, which is a visual representation of the workflow.
In a simple three-column (or three-bin) Kanban board (like in the image above), when the team decides to start working on a task, the Kanban card (sticky note) is moved from the To Do to the Doing column. When the team completes its task, it’s moved to the Done column.
Kanban boards often have additional columns. For example, data science teams might split data into 3 columns: data gathering; analysis; transformation/preparation.
Should we use Kanban for Data Science Project Management?
Kanban can be quite effective for data science. And if you recall the 2020 pool done by datascience-pm.com from the previous section, Kanban was the 3rd most popular approach.
Kanban’s flexible processes provide data scientists with greater flexibility to execute their work, without having to hit constant deadlines – on a Kanban team, there are no required time boxes.
Like other Agile approaches, work is divided into small increments, which allows for rapid iterations and continuous delivery. Kanban provides some structure, which is more than what a lot of data science teams currently have (Saltz, Shamshurin & Crowston, 2017).
Kanban best practices
Kanban is only part of the solution for a data science project management approach. Teams that use Kanban need other processes in place, such as mixing it with another Agile framework (i.e. Scrum or CRISP-DM) that encourages effective customer interaction.
According to this article about the best Kanban tools of 2020, here are the top 5:
- Task world
Pros of Kanban
- Highly Visual: The highly visual and simple nature of Kanban boards make them very effective at quickly communicating work in progress for team members and stakeholders (Brechner, 2015).
- Very Flexible: By pulling in work items one at a time as opposed to Scrum’s batch cycle approach, Kanban provides teams with greater flexibility to shift gears (Rigby, Sutherland, & Takeuchi, 2016). Conflicts about not being able to complete items by the sprint deadline are avoided.
- Lightweight and Adaptable: By not prescribing time boxes, roles, and meetings (Rigby, Sutherland, & Takeuchi, 2016), the overhead requirements to manage Kanban are significantly less than both Waterfall and Scrum. Kanban provides teams with the freedom to adopt their own additional processes.
- Avoids Culture Clash: As a very simple system that does not redefine team roles, Kanban is met with less cultural and organizational resistance than Scrum (Rigby, Sutherland, & Takeuchi, 2016). Kanban is not an invasive process; rather, teams can adopt it seamlessly without an awkward shift to a radically different system.
- Better Coordination: The simplicity, visual nature, lightweight, and flexible structure without stressful deadlines might make Kanban more conducive to teamwork than other approaches.
- Minimize Work in Progress: The WIP (Work In Progress) limits can increase total throughput and reduce investments in uncompleted work because they prevent too much WIP from piling up in a given process (Brechner, 2015).
Cons of Kanban
- Customer Interaction Undefined: As an inward-facing process, Kanban does not directly prescribe outward-facing processes for rich and frequent customer feedback loops. Customers may not feel as committed to the process without the structured cadence of sprint reviews (Akred, 2015).
- Lack of Deadlines: Relative to Scrum, without the motivation to hit the constantly looming deadlines, teams may work on certain tasks for an excessive length of time. A higher level of team discipline is needed to ensure that tasks do not slide longer than needed.
- Kanban Column Definition: How do you define the columns for a data science Kanban board? There doesn’t seem to be a good answer. To set up a data science-specific board, you would either need to use a generic board or attempt to create a board that encompasses all steps of a data science process.
Why not Hybrid models?
Since there isn’t a one-size-fits-all model, teams often create their own hybrid model. A Hybrid model is the combination of two or more methodologies, modified to fit the unique business environment.
When should we use Hybrid?
Hybrid can be used for small, medium and large projects. It’s an effective solution when product delivery relies on both hardware and software operations. But there is another reason to choose Hybrid:
The situation in which a customer is not satisfied with an unspecified timeframe and budget, as well as lack of planning, is not rare. Such uncertainty is typical for Agile. In this case, planning, requirements specification, and an application design can be accomplished in Waterfall. Agile is in place for software development and testing.
Now, although creating a Hybrid model can be challenging, it can provide great value since you can cherry pick the best qualities from existing models and mix and match them.
I know what you’re thinking, can I just pick the best qualities and that’s it? Of course not, there’s a catch. I’m gonna reveal it in the next section.
Now let’s take a look at some examples. There are two common hybrid approaches for data science:
Also known as Waterfall-Agile, combines elements of Waterfall and Scrum. It’s challenging, but there are specific circumstances, like projects involving medical equipment data processing, where such an approach works best.
What does Bimodal mean? There are two modes for developing and releasing software. Mode 1 is rigid and predictable, but safe. Mode 2 is agile and fast, but risky.
- Mode 1 is traditional; thus, it works perfectly in well-understood and predictable areas. According to Gartner, it focuses on exploiting what is known while transforming the legacy environment into a state fit for a digital world.
- Mode 2 involves rapid application development. It is exploratory, nonlinear, and optimized for solving new problems. Mode 2 is especially useful for working on projects that need to be finished as quickly as possible.
Can we use Bimodal for Data Science Project Management?
There is a mixed feeling about combining an Agile and non-Agile approach. But Eric Stolterman, Senior Executive Associate Dean at Indiana University, believes this combination is good. He says that crystalline processes such as Waterfall and liquid processes such as Scrum should simultaneously co-exist.
Mark Schiffman, Senior IT Project Manager at KSM Consulting, agrees. He describes an ideal data science project management approach as something “like Scrum with a Waterfall wrapper around it”. He suggests that the Waterfall view is better for customer-focused aspects, such as requirements gathering to “keep them one step ahead of development” and that the Scrum view is better for development (Schiffman, 2017).
Carol Choksy, Associate Chair of Information and Library Science at Indiana University, similarly believes that data projects require more intense upfront planning, which is best handled using traditional planning principles from the PMI Project Management Body of Knowledge. Once, the initial requirements are scoped, then Agile is appropriate (Choksy, 2017).
It’s tempting to pick and choose only the best aspects from various models to form your own, but such Hybrid approaches have been criticized for failing to provide both the structural benefits of Waterfall and the flexibility benefits of Agile (Sutherland, 2014). Agile die-hards (also known as Agilists) label such approaches negatively, calling it “wagile”, “fragile” or “scrumfall” to discredit their use. So what are the best practices you can use to make it work?
Best practices for Bimodal approaches
Organizations struggling with bimodal projects should adopt a more outcome-centered approach to project management, according to industry analysts from Gartner. This will help them manage the different requirements of “slow” (Mode 1) and “fast” (Mode 2) IT most effectively.
- Use A Simple Approach To Determine Which Mode Makes Sense:
Determining which mode to use on a project often has more to do with company culture than anything else.
- Define The Intended Business Outcome As The Measure Of Success:
It’s easy to determine the business outcome when you use mode 1, since everything is well understood and predictable. It’s not true for mode 2, which happens to be exactly how data science projects work. Therefore it is important to get clear on what success should look like before starting the project.
- Clearly Separate Portfolio And Project Governance:
The final area where many organizations struggle is maintaining a consistent approach toward governing the project after it has begun.
To better understand this, let’s break down portfolio governance and project governance.
The set of policies, regulations, functions, processes, procedures and responsibilities that define the establishment, management and control of projects, programmes or portfolios.
Aims to answer the question how organizations should oversee portfolio management, is mainly concerned with areas related to portfolio activities. Effective portfolio governance ensures that a project portfolio is aligned to an organization’s objectives, is sustainable, and can be delivered efficiently.
Portfolio governance also guides investment analysis to:
- Identify threats and opportunities,
- Assess change, impacts and dependencies,
- Achieve performance targets,
- Select, schedule and prioritize activities.
Tools for Bimodal
There’s no clear set of tools for Bimodal methodologies. Through experimentation, you need to create a set that works for you. Doing so, you need to consider tools for each of the 2 modes.
- Mode 1 tools: uses BI (Business Intelligence) tools that aren’t great for ad-hoc processes, nor visualization/data discovery. They’re generally not Agile, with the time-to-value from data capture, metadata creation and content delivery often measured in months.
- Mode 2 tools: These have more or less everything you need for data analysis contained in a single desktop environment. They’re Agile, flexible and focus on visual presentations that are beautiful and intuitive.
The problem is that Mode 2 tools can’t do everything. Things like pixel perfect reports, complex modeling, and bursting are not in their wheelhouse—nor their roadmaps.
These insufficiencies leave organizations with the conundrum of either sacrificing important functionality or striking a balance by adopting a Bimodal approach.
Pros of Bimodal
There are plenty of benefits to combining two IT modes:
- Speed. By defining and managing one IT area to focus on delivering new solutions, they can produce rapidly to meet business needs.
- Innovation. Because Mode 2 isn’t focused on maintaining security and handling daily issues, they can stay focused on wider problems that require innovation to solve.
- Agility. The goal for many organizations is to disrupt a certain industry – by defining which parts of IT focus on these disruptions, they can get there faster. Those in Mode 2 IT become adept at Agile practices, so there’s less risk and overhead, and the effort is smoother as time goes on.
- Reduces “Shadow IT”. When users get the solutions they need quickly, they are much less likely to use unauthorized or unproven applications and software – they aren’t bypassing IT.
Cons of Bimodal
- The separation can be discursive. By explicitly separating people into mode 1 and mode 2 groups, teams may battle for attention, resources, power, and influence. This can create a mentality of “us vs. them” within the larger IT sphere.
- The separation can be too neat. Defining two IT modes in this way can seem that the modes won’t, or shouldn’t, rely on each other. For many enterprises, the reality is that an innovative, well-functioning application or software solution, the goal of Mode 2, often relies on well-oiled legacy systems that are inherent in Mode 1.
- The separation can be confusing. Confusing teams simply for the sake of “innovation” often leads to confusion on roles and processes. This confusion can manifest as resistance to change, common when employees are told about changes that don’t make sense to them.
- The separation doesn’t guarantee innovation. Simply defining one team as innovative doesn’t mean it will just happen – if it did, everyone would be innovators. In fact, some enterprises find that innovation comes from the blending of skills and tools, not from intentionally drawn lines.
If you combined Scrum and CRISP-DM, you would get something that looks like Microsoft’s Team Data Science Process. Launched in 2016, TDSP is “an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently.” (Microsoft, 2020).
TDSP’s project lifecycle is like CRISP-DM, in that it includes five iterative stages:
- Business Understanding: define objectives and identify data sources
- Data Acquisition and Understanding: ingest data and determine if it can answer the question (effectively combines Data Understanding and Data Cleaning from CRISP-DM)
- Modeling: feature engineering and model training (combines Modeling and Evaluation)
- Deployment: deploy into a production environment
- Customer Acceptance: customer validation if the system meets business needs (a phase not explicitly covered by CRISP-DM)
Microsoft explains that “TDSP helps improve team collaboration and learning.” So now, one question is left.
Should we use Microsoft TDSP for Data Science Project Management?
Yeah! By combining modern practices with the data science life cycle, TDSP is very comprehensive and made of four major components:
- A data science life cycle definition,
- A standardized project structure,
- Recommended infrastructure and resources,
- Recommended tools and utilities.
It comes as no surprise that this approach often leverages Microsoft Azure, but it doesn’t have to. You can have a different tech stack and still use TDSP.
The TDSP lifecycle is modeled as a sequence of iterated steps that provide guidance on the tasks needed to use predictive models.
You deploy the predictive models in the production environment that you plan to use to build your intelligent application. The goal of this process life cycle is to continue to move a data-science project toward a clear engagement end point.
Data science is an exercise in research and discovery. If you couple that with an approach that gives you a well-defined set of artifacts to streamline communication, you can avoid a lot of misunderstandings.
For each stage of TDSP, we provide the following information:
- Goals: Specific objectives.
- How to do it: Outline of the specific tasks, and guidance on how to complete them.
- Artifacts: The deliverables and the support to produce them.
TDSP best practices
Researching TDSP best practices, I didn’t find much. Later on, I realised that TDSP includes best practices and structures from Microsoft and other industry leaders right out-of-the-box, and this helps teams move toward successful implementation of their data science initiatives. The goal is to help companies fully realize the benefits of their analytics program.
Note: For a detailed description of TDSP best practices, check the reference section at the end of this article, under “Microsoft TDSP” you will find all my sources.
TDSP is a modern approach, so tools for it include the latest and greatest. Let’s take a look!
Tools for TDSP
TDSP recommends infrastructure, resources, tools, and utilities that leverage modern cloud-based systems and practices like :
- GitHub: Increase collaboration, automate your code-to-cloud workflows and help secure your code with advanced capabilities.
- Azure Pipelines: Implement CI/CD to continuously build, test and deploy to any platform and any cloud.
- Azure Boards: Plan, track and discuss work across your teams using Kanban boards, backlogs, team dashboards and custom reporting.
- Azure Monitor: Get full observability into your applications, infrastructure and network.
- Visual Studio: Use the integrated development environment (IDE) designed for creating powerful, scalable applications for Azure.
- Azure Kubernetes Service (AKS): Ship containerised apps faster and operate them more easily using a fully managed Kubernetes service.
Pros of TDSP
- Comprehensive: More than just a process – complete with re-usable templates on GitHub, role definitions, and more.
- Optional Inclusion of Scrum: TDSP can, optionally, be used in conjunction with Scrum (where a sprint goes through all the phases).
- Maintained: Microsoft seems to update its guide and repository every few months.
Cons of TDSP
- Some Inconsistency: Microsoft seems to sometimes forget to extend some of its updates to all of its documentation.
- Steep learning curve: Some teams find the comprehensive framework complicated to learn and with too much structure (e.g. detailed management roles and specific document templates).
- Microsoft Specific Aspects: While much of the framework is independent of the Microsoft technical stack, there are other parts of the framework, especially the infrastructure, that specifically mention Microsoft products.
As a modern approach combining CRISP-DM elements with Agile approaches, Domino’s Data Science Lifecycle is conceptually similar to Microsoft’s TDSP. It has a more extensive process flow, but fewer supporting resources.
Domino is a project framework with six phases:
- Ideation: Defines the problem, scopes the project, and has a go/no go decision point
- Data Acquisition and Prep: Identifies existing data sets, explores the need to acquire new ones, and prepares the data sets for modeling
- Research and Development: Hypothesis testing and modeling
- Validation: Business and technical validation
- Delivery: Deployment, A/B testing, and user acceptance testing
- Monitoring: Systems and model monitoring
Well this approach has the right ingredients, it takes after CRISP-DM and some concepts similar to TDSP, right? No, not really. In the previous section about Bimodal approaches, we saw that “it’s tempting to pick and choose only the best aspects from various models to form your own, but such hybrid approaches have been criticized for failing to provide both the structural benefits of waterfall and the flexibility benefits of agile (Sutherland, 2014).”
This is to say that looks can be deceiving, so we still need to answer the main question below.
Should we use Domino Lifecycle for Data Science Project Management?
Luckily, the answer here is yes!
The Domino Lifecycle is ideal for big teams that prefer to build their own common-sense approach (especially in terms of defining project phases), and aren’t looking for a complex approach.
Adopting and adhering to a single project framework can help address many underlying reasons for failure in data science work, but managers must be careful when scaling if they want the framework to stay useful.
Best practices for Domino Lifecycle
Domino can be used to manage teams of 100+ data scientists and 300+ simultaneous projects, here is how:
- Measure everything, including yourself. Ironically, data scientists live in the world of measurement yet we rarely measure our own work. Tracking patterns in aggregate workflows helps create modular templates, disseminate best practices from high-performing teams, and guide investment to internal tooling and people to alleviate bottlenecks.
For example, by examining their complete body of work over multiple years, one large tech company realized they only have 15-20 canonical problems to solve and then planned to just apply templates where appropriate. Another organization does quarterly reviews of the aggregate state of hundreds of projects, and realized they were consistently blocked in ETL, so they re-allocated their budget to increase data engineering hires.
- Focus on reducing time to iterate. Many organizations consider model deployment to be a moonshot, when it really should be laps around a racetrack. Minimal obstacles (without sacrificing rigorous validation) to test real results is another great predictor of data science success.
For example, leading tech companies deploy new models in minutes, whereas large financial services companies can take up to 18 months.
- Socialize aggregate portfolio metrics. Even if it’s not precise, it’s critical to socialize the impact of the whole portfolio of data science projects. Doing so addresses data scientists’ concerns about impact, and helps address executive level concerns about investing in data science.
Communicating with stakeholders, they might be shocked to learn how many projects are actually in progress. It’s better if they know all that’s going on.
Importantly, many successful data science managers don’t claim the credit for themselves, but as a collective achievement of all the stakeholders.
Tools for Domino Lifecycle
Like CRISP-DM, this approach also encourages interoperable tools, so it’s tool neutral.
Pros of Domino Lifecycle
- Based on “what works”: Domino built this process based on learnings from over 20 data science teams.
- Broad team definition: The team is not just a technical team but also involves business stakeholders and a product manager. Every relevant stakeholder (technical or business) participates in pre-project ideation.
- Flexible: Practices can be mixed and matched with other approaches.
- No fixed cadence: As Mac Steele, the former Director of Product at Domino observed, data science doesn’t “magically happen in two-week blocks.”
Cons of Domino Lifecycle
- Less comprehensive: Compared to its cousin, TDSP, Domino’s process doesn’t have reproducible templates and detailed definitions.
- Not updated: Domino defined this process in a one-off guide and hasn’t updated it since.
- Team Coordination not defined: While the process suggests to do many iterations (through their phases), it’s not clear how the team should decide what is “in” an iteration, and how to structure the dialog with the business stakeholder.
Data Driven Scrum (DDS)
Scrum tends to be unpopular among data scientists. One key reason is that Scrum defines fixed-time sprints that aim to deliver potentially shippable increments. Unfortunately, this can short-circuit the experimental nature of data science.
On the other hand, DDS (Data Driven Scrum) was designed with data science in mind. Specifically, DDS supports lean iterative exploratory data science analysis, while acknowledging that iterations will vary in length due to the phase of the project (collecting data vs creating a machine learning analysis).
DDS combines concepts from Scrum and Kanban. You might’ve heard about another approach which didn’t make my list, Scrumban. What’s the difference between Scrumban and DDS if they combine the same concepts? Scrumban is more like Kanban within a Scrum Framework. DDS doesn’t implement Scrum’s fixed length sprints, but Scrumban does, which introduces several challenges for data science.
Should we use DDS for Data Science Project Management?
Yes! Data Driven Scrum (DDS), besides being based on a familiar and well known Agile framework, also counters the main problem with fixed-time sprints by removing some stuff from Scrum:
- Iterations are variable-length, capacity-focused, and they can overlap.
- Instead of delivering potentially shippable backlog items, the iterations focus on data science native concepts (e.g. experiments, questions to be answered). Each of these items are broken down into tasks to create, observe, and analyze.
- Backlog item selection occurs in a more continuous manner.
DDS is pretty much streamlined Scrum, with a small addition of a Kanban board and WIP (Work In Progress) limits to provide a visual way to communicate workflow and identify bottlenecks.
This is a very unique approach, and of its greatest strengths is also its biggest weakness. In the wise words of Jean Vanier:
“Growth begins when we begin to accept our own weakness.”
Nothing is perfect, so what are the best practices to make DDS work for us?
Best DDS practices
The following are 3 best practices if you’re going to use DDS in your organization:
- Allow capability-based iterations – there is no one-size-fits-all here, sometimes it makes sense to have a one-day iteration, other times a three-week iteration (due to how long it takes to acquire/clean data, or how long it takes for an exploratory analysis). The goal should be to allow logical chunks of work to be released in a coherent fashion.
- Decouple meetings from an iteration – this is a double edge sword. If done right it can yield great results, but done badly it could destroy team performance. Meetings at the end of an iteration are fundamental for learning and planning improvements. But since an iteration could be very short (ex. one day for a specific exploratory analysis), DDS meetings (such as a retrospective to improve the process) should be based on a logical time-based window, not linked to each iteration.
- Only require high-level item estimation – in many situations, defining an explicit timeline for an exploratory analysis is difficult, so one should not need to generate accurate detailed task estimations in order to use the framework. But, high-level “T-Shirt” level of effort estimates can be helpful for prioritizing the potential tasks to be done.
Pros of DDS
- Capacity-based iterations: Acknowledges the benefits of defined iterations but skips the inflexibility of hard deadlines to deliver potentially ship-able increments.
- Fits with Scrum organizations: The similarities with Scrum extend beyond just the process name which makes DDS attractive to Scrum-friendly organizations.
- Flexible to various life cycles: Not all data science projects follow the same CRISP-DM-like life cycle. By avoiding a life cycle definition, DDS can be adapted to different types of data science projects.
- Decoupling iterations from reviews and retrospectives: Divorcing reviews and retrospectives from the completion of an iteration enables short, frequent iterations while still maintaining a regular schedule for these ceremonies.
Cons of DDS
- Not comprehensive: Lack of life cycle can be a detractor to teams looking for defined steps, like those found in CRISP-DM, TDSP, or the Domino Data Science Lifecycle.
- Decoupling iterations from reviews and retrospectives (alternate view): This could make reviews and retrospectives seem stale when they occur.
Research & Development
Broadly speaking, Research and Development is the general set of approaches for investigating, prototyping, and producing innovative products.
A data science project can be viewed as research where the output transitions into an engineering project. A Research and Development approach divides the overall project into two broad pieces:
- a data science research phase,
- followed by an engineering/development phase.
Each of these phases would be managed using a different methodology, typically a loosely-structured or even ad hoc approach to the data science phase, and an Agile approach for the engineering phase.
What is R&D?
First we have to understand the fundamental differences in method between research and development:
- Research – During this stage, many approaches and analyses are tried quickly and discarded. Some ideas are partially explored and then abandoned. Experiment tracking tools like NeptuneAI or MLFlow, combined with a Jupyter or Zeppelin notebook, are used as a history of experiments. Time is not spent writing error recovery code or writing subroutines, because the code might never be used.
- Development – During this stage, there are now requirements, produced as a result of the research stage. However, the code in the researcher’s notebook is generally not production quality. Reengineering the researcher’s code is frequently required to make this code a good fit for a production environment.
Should we use the R&D approach for Data Science Project Management?
Data science and ML are experimental in nature. It’s only natural that the best approach would be R&D, since it keeps the doors open to innovation and avoids syphoning creativity and innovation through a bunch of standardized series of steps.
On a personal note, I think this needs to be said more: any endeavour that requires creativity suffers when it’s highly standardized and regulated, because creativity is messy!
Research and Development have brought data science this far, and if we want to continue pushing the boundaries of what’s possible, we should keep R&D alive.
I have two interesting study cases that bring more weight to this approach: Google Brain and DemandJump. Both use a largely ad hoc approach for the data science phase, and an Agile approach for the development phase.
Best R&D practices
The following methods have shown promise when coupled with the R&D approach:
- Encourage your data scientists to understand the deployment phase and your engineers to understand the research phase. Developing an end-to-end understanding is invaluable.
- Encourage your data scientists to consider the production requirements and technical complexity of deployment when considering model designs in the research phase. For example, if the requirement is for real-time inferences, is the model lightweight enough to support that?
- Embedding data engineers with the data scientists can be successful. Data engineers work to extract and prepare the data needed by data scientists. This allows the data scientists to focus on their highest value, while ensuring that the data engineer is aware of the compromises being made during the model building process, as compared to production data.
- Job rotations are another method to encourage this understanding. For example, a data scientist could become a data engineer for approximately 3 months.
- Functional tests must be designed by the data scientists to enforce the same outcomes—or statistically defined similar outcomes—meaning the production code should match development code on the same data. A representative dataset is selected by the data scientist for testing. For testing deterministic functions such as feature engineering, the production results must equal the functions written by the data scientists. The production team will then implement the tests in the production environment and include the tests in any production QA.
Tools for R&D
This approach is experimental in nature as the name suggests, so it also encourages the use of interoperable tools and is not limited to a specific set of tools. But some of the common tools for running and tracking experiments in the industry are:
- Jupyter notebook – used to run experiments,
- Zeppelin notebook – used to run experiments,
- Github – code versioning and repository,
- NeptuneAI – tracking different experiments and is easily coupled with other tools and frameworks,
- MLflow – tracking different experiments and managing both model and data,
- Pytorch and so on.
Pros of R&D
- Fits data science life cycle: Data science work needs a looser structure than deployment / engineering work. So manage each phase accordingly.
- Fits research backgrounds: Many data scientists come from research backgrounds and will find this approach natural.
Cons of R&D
- Phase transitions: Projects often fail at hard life cycle transition points. Coordination is needed with engineering, IT, and the business at the start of the project – not as an afterthought.
- Ad hoc data science: The initial data science phase could fall victim to the shortcomings of ad hoc processes.
Comparison tableDownload the table in PDF
My favorite approach
Writing this article was very interesting to me because I was using Scrum-ban for creating custom Software and Data Science solutions for my customers with trello, github, pytorch and aws cloud as my tools of choice but I wanted to investigate and explore what other approaches are out there. It was interesting to challenge because I learned so much.
I’m big on interoperable tools because they give you the flexibility to plug and play any new tool or replace an existing. So, for that fact and many others mentioned previously I fell in love with not just one but three approaches, namely:
Thanks for coming with me on this journey! Here are my final thoughts:
I think it all comes down to what works best for your organization, based on the culture and the people. The most popular approach might not be a good fit for your organization, and a fantastic fit for another one.
So from a friend, engineer, geek and data scientist point of view, I think you should use the data presented here about different approaches, do your own research, and experiment with approaches so you find out which one best suits your organization. Don’t try to fit your company into the most popular approach, but adapt the approach to your needs.
With that said, I organized a huge list of reference links below, for you to dig deeper into any of the topics we discussed. Thank you, and good luck in your projects!
Waterfall vs Agile
Research & Development
ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It
Jakub Czakon | Posted November 26, 2020
Let me share a story that I’ve heard too many times.
”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…
…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…
…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”
– unfortunate ML researcher.
And the truth is, when you develop ML models you will run a lot of experiments.
Those experiments may:
- use different models and model hyperparameters
- use different training or evaluation data,
- run different code (including this small change that you wanted to test quickly)
- run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)
And as a result, they can produce completely different evaluation metrics.
Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.
This is where ML experiment tracking comes in.Continue reading ->