April 15, 2025
Author: Phil Henrickson
I’ve been working with a number of organizations at varying levels of data science (DS) maturity over the last five years. Many of these organizations are just at the start of their journey into using machine learning (ML) and artificial intelligence (AI) to solve problems. These are the lessons I’ve learned about working in data science.
“How do we get started with data science?”
I’m asked this question a lot. Lately, people have started saying “AI” instead of data science, but it’s the same question.
The how isn’t nearly as important as you might think. What matters is that you identify a problem and start somewhere in trying to solve it. I’ve worked with organizations that will talk about the potential use cases they have for AI/ML, the pain that comes from their current processes, and yet they’ll peel back from ever starting a project because “we’re just not ready yet; our data’s not ready”.
Here’s the thing: the data will never be perfect, for any project, ever. You will always need to clean, transform, and monitor your data. It’s part of the process. There’s a recurring sentiment in the industry that ‘data scientists spend 80% of their time cleaning data and only 20% creating insights!’, as if this evidence of a problem. In practice, data scientists spend a lot of time cleaning data because getting to know the data is the typically the hardest, fuzziest, and most important part of a project.
Instead of waiting for the data to be perfect, which it never is, you just have to start somewhere. Get into the data as quickly as you can. You will learn more about your data and your problem in one day of working in the weeds with it, than in the weeks (months) you spend creating a project proposal. The mentality you need to embrace:
This process should sound familiar; it’s the whole premise of machine learning.
Data science projects can often feel like a changing of the guard. There’s the old way of doing things, and now someone is proposing a change. This can easily lead to ruffled feathers and resentment between the business and IT, or between the old guard and the new analysts. You’ll often hear data scientists talk about how they presented their findings to a manager or an executive who later rejected it because “they just didn’t understand” or “they’re fine with the way things are”. Two things to keep in mind about this process.
First, if you’re going to spend time on a data science project, the purpose and value of the project needs to be evident and agreed upon from the start by all parties. There is little sense in pouring time into a project if the stakeholders aren’t interested. A project needs to be focused on solving a problem or creating new capabilities, both of which require buy-in from stakeholders. Don’t spend time working on a solution in search of a problem.
Second, if the stakeholders do not understand the project or its value, you as the data scientist have failed. Communication cannot be an afterthought. Good communication is as important to a project’s success as good code. In my experience, the best way to get stakeholders on board with a project is to get them involved from the very beginning. Involve them in scoping discussions. Ask them for ideas. Solicit their feedback at each iteration. Make them feel involved and included in the new way of doing things because they are. Don’t spring the end product on them all at once and extol its virtues about how it’s better than what they’re doing.
In practice, most data science projects have the technical group developing the solution/model and the non-technical group that will be using/interacting with it. Collaboration and trust between these two groups is utterly essential to a successful project. Trust can take months to build and can be lost in a minute. The solution you’ve been working on is only valuable if it is used.
When working on a project, any project, you should quickly define, ‘what is the simplest approach to solving this problem?’. Set a baseline for performance from the simplest approach and compare everything else to it. There’s a temptation by data scientists, especially junior ones, to try to throw the most complex models at every problem when simple approaches would work just as well. Bleeding edge algorithms are cool, but it will make your life a whole heck of a lot easier if you can get comparable performance from a linear model.
In the same way, organizations make things easier for themselves by starting with simple projects. A common pitfall we’ve seen, especially for organizations that see themselves as being behind in the AI/ML landscape, is to try to make a massive leap forward with one big project. This leads to a project that is massive in scope, has ever-changing and poorly-defined requirements, and involves many different teams attempting to work together for the first time. This is almost always a recipe for failure, regardless of the domain. It usually leads to the conclusion that data science isn’t worth their investment, when the failure had nothing to do with data science.
Start small and try to answer the question, “is there value to using AI/ML for this problem?” as quickly as you can. If the answer turns out to be no, punt and move on to another project, or find a way to simplify the problem.
Instead of trying to train a hierarchical model to generate minute by forecasts for every product in your inventory, maybe first try to see how well you can do forecasting total sales a month out.
Instead of trying to use a classifier to tag every possible emotion expressed in your customer feedback surveys, see if you can use survey responses to predict whether they’ll buy from you again. Or, better yet, maybe just add a question to the customer feedback survey that asks them if they’d buy something from you again.
Invariably, like a moth drawn to a flame, conversations in data science lead to discussion sabout the stack, the suite of technologies an organization happens to be working in with its data. This is a subject which draws a tremendous amount of attention and bandwidth: Which vendors should we be using? What should we build and what should we buy? Which languages/cloud providers/modeling framework/LLM/[insert any-new-tech-trend] should we be using?
To any of these questions, I can give the answer, “it depends”, which is both correct and largely unhelpful.
Ultimately, I don’t think the answer to getting value from data science lies in the technology you use. All of the technology in the world does not matter if you do not understand your problem and the data and methodology that would help you solve it.
There is no easy button; there is no tool that you can buy and start solving all of your problems. The value in (data) science usually doesn’t come from algorithms, tools, platforms. The value in (data) science is usually from the creativity/dedication/passion of asking questions and caring about finding the right answer.
For an organization, the journey into data science isn’t just a matter of hiring someone with machine learning on their resume. It’s not about buying into a vendor that has a url ending in .ai. The journey into data science is about finding the right blend of engineering, science, software development, communication, and project management, where people and process are just as important as any of the technologies you use.