Understanding the Realities of Data Science: Insights for Professionals
Written on
Data science is often perceived through an overly optimistic lens, particularly due to the influence of major tech companies and various online narratives. As someone with a background in software engineering, I frequently find myself navigating the complexities between data scientists and engineers. The friction that arises in this dynamic is undeniable.
In this article, I will share my reflections as a data scientist, outlining key realizations and humorous insights along the way. Here’s what you can expect to take away:
- Five key understandings about data science,
- Fifteen lessons for building an effective data science team,
- A sprinkle of my distinctive humor.
Understanding 1: The Importance of Hypothesis
Data science encompasses much more than just model tuning or adjusting parameters. It fundamentally revolves around the scientific method, which we learned in school. Data scientists analyze datasets alongside business challenges, formulating experimental plans to achieve objectives. However, it's easy to fall into the trap of merely applying various models in search of one that performs adequately. This approach not only lacks elegance but can lead to misguided conclusions.
Consider the dramatic rise in Bitcoin's price to over $50,000 in February 2024. Was this spike merely a result of technical analysis or is there deeper market psychology at play? If we fail to start with a hypothesis, we risk confusing correlation with causation.
> Lesson 1: Begin with a hypothesis to avoid modeling mere correlation instead of causation.
> Lesson 2: Avoid retrofitting models without a hypothesis, which can lead to survival bias.
> Lesson 3: Ensure that your colleagues provide documentation on model rationale before deployment.
Understanding 2: The Need for Cleanliness
Jupyter Notebooks evoke mixed feelings in me. While they're excellent for experimentation, they often end up being cluttered and disorganized. Unfortunately, these messy notebooks can become the immediate deliverables for projects at work. Who cleans up this chaos? Often, engineers are left to interpret our work without sufficient guidance.
Imagine if you modified a log-transform function in a model training notebook but didn't document it. Would the engineers know to update the production data pipeline accordingly? If you revisit that notebook in two months, how frustrating would it be to decipher your past self's work?
> Lesson 4: Document your code thoroughly, especially if you expect others to use it later.
> Lesson 5: Be considerate of your colleagues. Document changes in training pipelines to ensure clarity.
> Lesson 6: Keep an inventory of necessary documentation and artifacts from your data science team.
Understanding 3: Embrace DataOps and MLOps
Your models may be performing well, but if the training processes are confined to your laptop, you're limiting their potential. Even the ancient Romans recognized the value of infrastructure, building aqueducts for progress.
Data scientists should familiarize themselves with MLOps and data engineering concepts. Instead of dismissing them as plumbing tasks, see them as essential components of your workflow. Knowledge in these areas enhances your ability to specify upstream and downstream requirements, as well as automate model and dataset benchmarking.
> Lesson 7: Constantly automate your processes to avoid becoming a bottleneck.
> Lesson 8: Understand related domains to enhance your expertise.
Understanding 4: Avoid Isolated Development
Perfectionism can often hinder progress. We know that no model is flawless, only those that serve a purpose. A model that doesn't receive real-world traffic is essentially useless.
What happens when stakeholders inquire about your model's status only to discover it's not in production? Engineers may end up shouldering the blame for not implementing it on time. However, the onus is on us to ensure our models are accessible.
When planning experiments, clearly define what success looks like. This way, if you achieve your targets, you can confidently push your model to production.
> Lesson 9: Developing in isolation wastes resources and creates friction between teams.
> Lesson 10: Prevent your team from getting trapped in endless experimentation without tangible outputs.
> Lesson 11: Establish clear goals for experiments and automate the release decision-making process.
> Lesson 12: If uncertain about your model's performance, consider alternative release strategies rather than hiding it.
Understanding 5: The Limits of Automation in Data Science
While advancements like LLMs and AutoML are impressive, data science is not entirely automatable yet. Metrics and results still require the scrutiny of data product owners before deployment.
Moreover, the sustainability of targets, such as a 10% improvement in f1 score, must be assessed. Without data scientists, engineers may find themselves ill-equipped to innovate in the ML space.
Managed solutions can facilitate rapid prototyping, while AutoML aids in decision-making. However, it's crucial to recognize the limitations of these tools and the unique insights that skilled data scientists provide.
> Lesson 13: Diversify your tech team with a blend of managed solutions, AutoML, and LLMs to reduce reliance on a dedicated data science function.
> Lesson 14: Full-stack, data-literate engineers are rare. Be cautious of claims that data science can be fully automated.
> Lesson 15: A mix of managed solutions, AutoML, and LLMs can make insights accessible, but delivering high-value, original insights remains vital.
If you've read this far, I hope my reflections resonate with you. This is a departure from my usual technical discussions, but if you enjoy this format, let me know! I welcome your thoughts—let's foster constructive dialogue.
Until next time, this is Louis.
> Note: The views expressed in this article are my own and do not reflect the opinions of anyone else.