Things About Real-World Data Science Not Discussed In MOOCs and Thought Pieces

Data science has been sweeping the tech world. With a large variety of powerful free open-sourced tools and now the computing power to utilize them to their full potential, data science is more accessible than ever and has become America’s hottest job. One problem: there’s no consensus on what data scientists really do in a professional setting.

There has been a rise in romantic thought pieces lately (especially on Medium) about how data scientists are wizards and can solve any problem (with bonus points if it cites AI). If you follow publications like Towards Data Science, you’ll notice persistent tropes in the more code-oriented posts: Python is the king programming language for data science, use scikit-learn/XGBoost and logistic regression for predicting categorical variable(s), use pandas for processing tabular data, use NLTK/word2vec for processing text data, use TensorFlow/Keras/convolutional neural networks for processing image data, use k-means for clustering data, split the processed dataset into training and test datasets for model training, tweak hyperparameters/model features until results on the test dataset are good, etc.

These tropes aren’t inappropriate or misleading, but the analysis often doesn’t quantify the insight/value of the results. Modeling is just one small part (and often the easiest part) of a very complex system.

Data-oriented MOOCs (Massive Online Open Courses) like Andrew Ng’s Coursera course on Machine Learning and fast.ai’s course on Deep Learning are good academic introductions to the theory and terminology behind data science and other related fields. Although MOOCs have many practice problems for prospective data scientists to solve, they don’t make you an expert in the field capable of handling messier real-world problems, nor claim to do so.

Modern data science isn’t about burying your head in a Jupyter Notebook and staring at the screen watching training loss numbers trickle down (although it’s definitely fun!). There’s a lot more to it, some of which I’ve learned firsthand working as a Data Scientist at BuzzFeed for over a year. To borrow a statistical term, MOOCs and thought pieces overfit to a certain style of data science that is not robust to the vast uncertainties of the real world.

The Cost/Benefit Tradeoffs of Data Science

Data science often follows the Pareto principle: 80% of the work takes 20% of the effort. Thought pieces demonstrate that you can just toss data indiscriminately into scikit-learn or a deep learning framework and get neat-looking results. The value of a data scientist, however, is when and if to further development on a model.

Kaggle competitions are a popular and often-recommended way to get exposure to real-world data science problems. Many teams of statisticians compete to create the best model for a given dataset (where “best” usually means minimizing the predictive loss/error of the model), with prizes for the highest-performing models. Kaggle also encourages clever modeling techniques such as grid search of thousands of model hyperparameter combinations and ensembling disparate models to create a megamodel which results in slightly better predictive performance, but just might give the edge to win.

However, there are a few important differences between modeling in a Kaggle competition and modeling in a data science team. Kaggle competitions last for weeks when a professional data scientist may need to spend time on other things. Ensembling gigantic machine learning models makes predictions very slow and the models themselves very large; both of which may cause difficulty deploying them into production (e.g. the Netflix Prize movie recommendation models famously “did not seem to justify the engineering effort needed to bring them into a production environment”). And most importantly, there may not be a significant practical performance difference between a 1st place Kaggle model that takes days/weeks to optimize and a simple scikit-learn/XGBoost baseline that can be built in a few hours.

Counterintuitively, it may be better to trade performance for speed/memory with a weaker-but-faster model; in business cases, speed and scalability are important implementation constraints. But even with scikit-learn, the model is still a black box with little idea to the data scientist how the model makes its decisions. One final option is to go back to basics altogether with a “boring” linear/logistic regression model, where the predictive performance may be even weaker and the model must follow several statistical assumptions, but the model feature coefficients and statistical significance are easily interpretable to explain the importance of each input feature (if any) and make actionable, informed decisions for the business. Being a data scientist requires making educated judgments about these tradeoffs.

Data Scientists Still Use Business Intelligence Tools

A hobbyist data scientist without a budget may opt to build their own workflows and data pipelines using free tools. However, professional data scientists have a finite amount of free time (as do all engineers), so there’s a massive opportunity cost when reinventing the wheel unnecessarily. Enterprise BI tools such as Tableau, Looker, and Mode Analytics help retrieve and present data with easy-to-digest dashboards for anyone in the company. They’re never cheap, but they’re much cheaper to the company than having a data scientist spend valuable time to develop and maintain similar tooling over time.

If a stakeholder wants a data report ASAP, there’s no problem falling back to using SQL to query a data warehouse and output results into an Excel spreadsheet (plus pretty data visualizations!) to quickly transport in an email. Part of being a data scientist is working out which tools are best appropriate at what time.

Some might argue that using BI tools and SQL are not responsibilities for data scientists, but instead for Business Analysts or Data Analysts. That’s a No True Scotsman way of looking at it; there’s a lot of overlap in data science with other analytical fields, and there’s nothing wrong with that.

Data Scientists Are Software Engineers Too

Although MOOCs encourage self-study, data science is a collaborative process. And not just with other data scientists on a team, but with other software engineers in the company. Version control tools like Git are often used for data scientists to upload their portfolio projects publicly to GitHub, but there are many other important features for use in a company-wide collaborative environment such as branching a repository, making pull requests, and merging conflicts. Beyond that are modern development QA practices, such as test environments, consistent code style, and code reviews. The full process varies strongly by company: Airbnb has a good thought piece about how they utilize their Knowledge Base for data science collaboration using Git.

One of the very hard and surprisingly underdiscussed aspects of data science is DevOps, and how to actually get a statistical model into production. Docker containers, for example, are newer technology that’s hard to learn, but have many data science and DevOps benefits by mitigating Python dependency hell and ensuring a consistent environment for model deployment and execution. And once the model is in production, data scientists, data engineers, and dedicated DevOps personnel need to work together to figure out if the model has the expected output, if the model is performing with expected speed/memory overhead, how often to retrain the model on fresh data (plus the scheduling/data pipelining necessary to do so), and how to efficiently route predictions out of the system to the user.

Data Science Can’t Solve Everything

Data science experiments (even those utilizing magical AI) are allowed to fail, and not just in the fail-to-reject-the-null-hypothesis sense. Thought pieces typically discuss successful projects, which leads to a survivorship bias. Even with massive amounts of input data, it’s likely for a model to fail to converge and offer zero insight, or an experiment fail to offer statistically significant results (common with A/B testing).

real world data science is an R² of 0.10 #GoogleNext18 pic.twitter.com/qNsno2dscR
— Max Woolf (@minimaxir) July 24, 2018

The difficulty of real-world data science is recognizing if a given problem can be solved, how much of your valuable time to spend iterating to maybe solve it, how to report to stakeholders if it can’t be solved, and what are the next steps if that’s the case.

Don’t p-hack!

Data Science and Ethics

During the rise of the “data science/AI is magic!” era, massive algorithmic and statistical failures suggest that data science might not always make the world a better place. Amazon built a resume-reading model which accidentally learned to be sexist. Facebook overestimated performance metrics on their videos, causing complete business pivots for media organizations in vain, indirectly leading to hundreds of layoffs. YouTube’s recommended video algorithms drove children towards shocking and disturbing content. And these companies have some of the best data talent in the entire world.

The qualitative output of a model or data analysis is just as important as the quantitative performance, if not more. Allowing dangerous model output to hit production and impact millions of consumers is a failure of QA at all levels. In fairness these companies usually fix these issues, but only after journalists point them out. The problem with blindly chasing a performance metric (like Kaggle) is that it ignores collateral, unexpected effects.

Don’t be data-driven. Be data-informed. Metrics should never be in charge because they have no moral compass.
— Kim Goodwin (@kimgoodwin) October 15, 2018

Maybe recommending shocking videos is what maximizes clickthrough rate or ad revenue per the models according to a business dashboard. Unfortunately, if the data justifies it and the business stakeholders encourage it, the company may accept the consequences of a flawed algorithm if they don’t outweigh the benefits. It’s important for data scientists to be aware that they may be party to that.

Conclusion

I realize the irony of using a data science thought piece to argue against data science thought pieces. In fairness, some Medium thought pieces do apply data science in very unique ways or touch on very obscure-but-impactful aspects of frameworks, and I enjoy reading those. The field is still very broadly defined, and your experiences may differ from this post, especially if you’re working for a more research-based institution. Unfortunately, I don’t have any new advice for getting a data science job, which is still very difficult.

The popular idea that being a data scientist is a 40-hours-a-week Kaggle competition is incorrect. There’s a lot more to it that’s not as sexy which, in my opinion, is the more interesting aspect of the data science field as a whole.

Max Woolf (@minimaxir) is a Data Scientist at BuzzFeed in San Francisco who works with AI/ML tools and open source projects. Max’s projects are funded by his Patreon.

The Cost/Benefit Tradeoffs of Data Science#

Data Scientists Still Use Business Intelligence Tools#

Data Scientists Are Software Engineers Too#

Data Science Can’t Solve Everything#

Data Science and Ethics#

Conclusion#