Lessons learned in my first year as a data scientist

It’s been a year since I started my first job as a data scientist. In that time, I’ve learned a lot, but most of that learning hasn’t been the type I expected. I’ve certainly learned some things about new technologies and techniques, but much of what I’ve learned has been about how to actually make my skills useful to others in the company.

For context, I work in a healthcare company where the data science group acts kind of like a consulting firm - we take on projects to create predictive models and provide other analytical support. For that reason, a lot of the lessons I’ve learned have more to do with the issues with operationalizing data science projects in the business.

Feature engineering is important

One of the big lessons from the year is that feature engineering is the most important part of building a predictive model. This might not be a surprised to anyone with a lot of machine learning experience, and I certainly had heard this when I started. But if I’m honest, it was hard for me to come up with examples of when or why it was important before I started. Seeing repeated examples of this in practice have really made me understand it on a deeper level. The kind of model you use doesn’t matter if your features suck, and being able to understand what kinds of features will be useful is a skill that requires abstract thinking and intuitive math. This is what makes machine learning hard.

Getting the best performance is not important

When I was initially learning data science and machine learning, there was a lot of emphasis on performance metrics. The truth is, for almost every project, you can get 90% of the actual value of the project quickly and the last 10% will take you ten times longer. For most projects, it just isn’t worth it. While with Kaggle you want to perfect the model as much as possible because a .01 increase in your performance metric of choice can make or break you, in the real world, tiny increases rarely matter (though obviously this depends on the industry and project).

The hard part of projects isn’t the machine learning, but everything else

More important than knowing how to build the best predictive model is knowing how to translate a business problem into a problem a predictive model can help with, and translating a predictive model into something actionable. When people from other departments hear we can build predictive models, they often come up with some idea of something they think would be interesting to predict. Often, their ideas don’t make sense, but explaining why that’s the case, the technical issues involved, and coming up with an alternative that will work - that’s important. It’s also important that once a model is built, the output is actually usable - it’s easy to predict things, but often the prediction is only useful if it comes with some additional insight. Using tools like lime has been useful and is part of the solution, but sometimes your features are not actionable or easy to understand for the end user, so lime is useful but not a silver bullet - more thought needs to be put in than just to stick lime on the end.

Ask the right questions BEFORE you start a project

Finally, probably the most important lesson I’ve learned the hard way this year: Make sure you know up front where the data is going to come from and how the project is going to be used before putting too much effort into it or committing to any kind of deadline. I had one project described to me that I was excited to tackle and promised a quick turn-around on, only to learn that writing the algorithm was the easy part. I was given a sample of data and had assumed that meant there was a process in place for me to acquire the data. In reality, there was no such process, and it took months to figure out how to get access to the necessary data. I ended up looking pretty bad for going way over my initial time estimate, even though most of that time was spent waiting for people to respond to emails so I could track down and get access to the necessary data. Figuring out data pipelines is really hard in a big company!

I also had to pick up a few projects that were started before me when another data scientist left. One was a dashboard put together to help navigate and visualize some text data. I was amazed when I talked to the end-user to check what else they wanted done with the project, and discovered all of the changes they wanted involved removing features. The dashboard was massively over-engineered, with some of its major functionality either removed or left in but never used. Worse, the dashboard was never meant to be in long-term use and would be retired in a year, so there was no chance these features would find eventual use. This meant time was wasted twice - once creating those features, and again going back and removing them to simplify the dashboard.

This and other experiences really taught me the importance of having a close feedback loop with the end user of whatever project I’m working on. A biweekly, half-hour meeting can end up saving a lot of time.