A Behind The Scene Story Of Why Scrum Doesn't Work For Data Science/ML Teams

Cristian Dordea

Feb 16, 2024

A Behind The Scene Story Of Why Scrum Doesn't Work For Data Science/ML Teams

A firsthand experience on why Scrum is not compatible with the Data Science/ML work and the need for a new agile delivery framework.

Highlights:

An inside experience of Data Science/ML delivery, why Scrum is not compatible with Data Science/ML work and the need for a new agile delivery framework.
As always, see AI for Business News You May Have Missed & the latest AI Training & Certifications at the end of our newsletter.

Intro

Two years ago, I had the opportunity to work closely with a Data Science/ML team for one of the biggest companies in the media and entertainment industry. In working with them, I noticed their type of work was different than all my other data teams, which resulted in different kinds of problems. These specific problems I haven’t seen before in my career, even though I’ve been helping teams solve agile delivery challenges for about 10 years at that time. And since I’m all about taking on new challenges and solving problems, this sparked my interest.

The Data Science/ML team was working towards learning more about our consumers using all services, in order to provide a more personalized selection of content to our users, and ultimately increasing user retention. They were doing this through really interesting methods such as behavioral analytics, user segmentation, content recommendations, and user personalization. As they were working on advanced analytics and all kinds of predictive models, they started to receive an increase in requests.

As this was happening, I noticed they handled requests coming into their team in a very ad-hoc way. This was a small team that was working on its own, without being plugged into the more mature agile delivery process of other data engineering and analyst teams. This worked fine at the beginning, but as they were providing more value, more executives were interested in their capabilities and what they had to offer. This resulted in a lot more requests and faster turnaround, which required a more mature process that would help the team prioritize their work as well as communicate to keep stakeholders informed on delivery progress.

At this point, I started to be more involved. In my role as the Director of Data Delivery & Program Management for the data group, I had two main goals:

Lead and evolve the agile strategy and methodologies for delivery teams for the purpose of enhancing efficiency, collaboration, innovation, and continuous improvement
Scale and drive the adoption of agile methodologies across multiple delivery teams

At that time, my agile delivery team and I, in partnership with the product and engineering leads, finished standardizing and scaling agile successfully across the entire data group (of approx. 60-80 people) by implementing the scaled agile framework. At the program level, we were planning and executing in product increments of 3 months. At the team level, most teams were using the Scrum framework, with some operational teams using just Kanban. This model worked well for all data engineers and data analyst teams, but the data science/ML team was not included, since they were a smaller team and mostly doing experimentation-type work.

The World of Data Science/ML teams and their challenges

As the team got bigger and their work increased, so did the complexity of managing expectations and the work itself. At first, I suggested what all other delivery leads or agile coaches would, which is to start using Scrum as all the other data engineers were doing. Boy, did that not work out as expected. When diving deeper into their work, their workflow, and their challenges, I noticed they were all different from the software development teams or even traditional data teams.

The development process in ML is all about experimenting. First, they need to figure out what data points and events to pick that would be better suited for their features and model creations. The approach to ML is a lot of mixing and matching of different features, algorithms, and so on to see what works. A significant part of this experimentation involves feature engineering – the process of selecting, modifying, or creating features to improve model performance. This process is more art than science, requiring intuition, domain knowledge, and iterative testing to identify the most compelling features. Choosing the suitable algorithm is another critical aspect. Each algorithm has its strengths and flaws depending on the nature of the data and the problem being solved. To better understand their data science/ML flow of work, the director of the Data Science Team, provided a great explanation of their workflow here.

Chatting more with the Director of the Data Science team made me better understand their unique challenges. With all this said, how were all these differences and challenges manifesting themselves when trying to operate within the Scrum framework?

Main challenges with using Scrum by a Data Sciece/ML team

The main difficulty was not knowing what was able to be done in a 1-week or 2-week sprint because estimating the user stories and tasks was highly unreliable. This was mainly due to big unknowns in data preparation, model development, and training. The team was not able to correctly estimate and figure out how long things would take until they actually started exploring the data and experimenting with the model creation. You might say that software development or more traditional data engineering work has the same challenges. To better understand the difference, I wrote this more detailed post on How Machine Learning systems stand out from traditional software systems.

The inconsistency and unknowns were so vast that it made it impossible to keep a time-based consistent sprint cadence. This resulted in difficulty with finishing sprint commitments in a time-based sprint. The only thing consistent was us not being able to match our work and goals with the sprint time-based cadence. The variability was just too high.

From Scrum to Kanban

As I noticed this, we pivoted the team from using Scrum to just Kanban. This eliminated the need to artificially force the team into a time-based sprint and ultimately follow a framework that was not compatible with their workflow. Kanban was better from the perspective of not having to force the team within a fixed timebox. This change also resulted in eliminating task estimation, which was not really needed since there were no sprints. This eliminated some of the team’s stress and increased the team’s agility. However, using just Kanban came with its own challenges. It wasn’t long until we had to figure out answers to questions such as, “How do you manage timelines and stakeholders’ expectations with just Kanban?” At the end of the day, we were in a Fortune 500 company that still had to answer to executives.

How to Scale

To solve these gaps, we integrated the team into the rest of the scaled agile processes with the other data engineering team. The difference was the Data Science team was executing using Kanban instead of Scrum like all other data teams.

The Data Science team started to take part of the Product Increment (PI) Planning we did with the large data group. If you are new to scaled agile, you should check out this post.. The Data Science team broke down their own work into high-level capabilities, which together formed a high-level goal of delivering a larger predictive model that delivered value. The commitment was done at the PI level, which in our case was 3 months. They didn’t wait until the 3 months to deliver the model. Their goal was to have multiple iterations of the model before the end of the PI, so they could test it and get feedback. By the end of the PI, they would have a working model. Executing using Kanban allowed them to release capabilities as they were done.

This approach worked better for the Data Science team, and our stakeholders were satisfied as well. Even though they were not able to get scheduled commitments on a locked-in weekly or bi-weekly basis, they knew that by the end of the PI, a primary goal would be finished. There was also an agreement that during the 3-month PI, the data science team would strive to complete smaller capabilities like predictive models or analysis springled within the 3-month PI Increment. This way, the Data Science team was able to stay agile and get feedback on smaller capabilities within PI as well as at the end of the 3-month PI increment.

This satisfied all groups: the data science teams, middle management as well as C-level executives.

Most people were happy with this outcome, but for myself, I realized there must be a better way to handle these Data Science/ML scenarios. Doing a brief search at that time, I realized there was no established delivery or agile framework for this type of work. In talking with other consultants and agile coaches in the industry, they confirmed they did not have a better approach either. That’s when my research started to find a better way to handle the Data Science/ML work.

A better Agile delivery framework

Fast track 2 years later, I found a better Agile delivery framework in which I recently got certified. This new agile framework is specifically for Data Science/ML projects. It’s called Data Driven Scrum, and in our following newsletter issue, I will dive deep into how this framework is different than Scrum and why it works better for Data Science/ML projects.

Generative AI for Business News You May Have Missed

Google announces AI Cyber Defense Initiative to enhance global cybersecurity
Google LLC today announced a new AI Cyber Defense Initiative and proposed a new policy and technology agenda aimed at harnessing the power of artificial intelligence to bolster cybersecurity defenses globally (read more)
OpenAI’s Sora joins text-to-video AI content generation race
OpenAI today announced Sora, a new text-to-video model that can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.(read more)
Nvidia unveils first look at Eos, its latest data center-scale supercomputer
Nvidia Corp. today provided the first public look at the architecture powering Eos, the company’s latest data center-scale supercomputer designed for accelerating artificial intelligence development, described as an “AI factory.” (read more)
Microsoft, OpenAI release new research on state-backed hackers’ use of AI models
Microsoft Corp. and OpenAI have revealed that several state-backed hacking groups are using artificial intelligence large language models to support their cyberattack campaigns. (read more)
Weak guidance weighs on Datadog’s stock after strong quarter
Datadog Inc. this morning disclosed that it had an expectation-topping fourth quarter, but weak full-year guidance sent the company’s stock tumbling over 4%. (read more)
Colossyan raises $22M to grow its AI-powered corporate training video production platform
Generative artificial intelligence video-based corporate training startup Colossyan Inc. said today it has closed on a $22 million round of funding.(read more)
OpenAI rolls out ChatGPT memory to select users
OpenAI has begun rolling out memory capabilities to a select number of ChatGPT users this week. Memory will allow the conversational agent to recall details from previous chats in order to provide more personalised and contextually relevant responses. (read more)
UK announces over £100M to support ‘agile’ AI regulation
The UK government has announced over £100 million in new funding to support an “agile” approach to AI regulation. This includes £10 million to prepare and upskill regulators to address the risks and opportunities of AI across sectors like telecoms, healthcare, and education. (read more)

AI Training & Certifications

Serverless LLM apps with Amazon Bedrock:
In this course, you’ll learn how to how to prompt and customize your LLM responses using Amazon Bedrock and deploy a large language model-based application into production using serverless technology
Building AI Applications with Vector Databases:
A 1h beginner-friendly course, you’ll harness the versatility of vector databases to build a wide range of applications using minimal coding!
Andrew Ng Founder of DeepLearning launches “AI for Everyone” course:
AI for Everyone”, a non-technical course, will help you understand AI technologies and spot opportunities to apply AI to problems in your own organization.
Generative AI for Executives by AWS
This class shows you how to mitigate hallucinations, data leakage, and jailbreaks. Incorporating these ideas into your development process will make your apps safer and higher quality.
Introduction to Artificial Intelligence (AI) by IBM on Coursera
In this course you will learn what Artificial Intelligence (AI) is, explore use cases and applications of AI, understand AI concepts and terms like machine learning, deep learning and neural networks.