Python Became the Language of Data in 2017

For years, the data science community had two dominant languages: R, beloved by statisticians and academics, and Python, the general-purpose language that increasingly had good data tools. By 2017, the balance had shifted decisively toward Python, and the reasons are instructive.

The library ecosystem was the decisive factor. Pandas had made data manipulation in Python as expressive as R's data frames. NumPy and SciPy provided the scientific computing foundation. Matplotlib and Seaborn gave visualisation. Scikit-learn gave machine learning. TensorFlow and Keras gave deep learning. Together, these libraries covered essentially everything a data scientist needed.

R had excellent statistical tools and a passionate community, but the Python ecosystem was broader and deeper in areas beyond pure statistics. Machine learning frameworks overwhelmingly supported Python. Web APIs were easier to work with in Python. Production deployment of models made more sense in Python. For teams that needed to take models from analysis to production, Python's versatility was decisive.

The Jupyter notebook deserves special mention. It had existed for years as IPython notebook, but the Jupyter rebranding and improvements in 2015 and 2016 had made interactive, reproducible data analysis a genuinely good experience. Writing code, seeing results, adding narrative text, and sharing the whole thing as a document: this workflow matched how data scientists actually thought about their work. Jupyter worked with Python natively and felt like Python's natural environment.

The community effect compounded things. More machine learning tutorials were written in Python. More Stack Overflow answers assumed Python. More job descriptions asked for Python. This created a self-reinforcing cycle: Python was where the resources were, so people learned Python, which created more resources.

For engineers who had been writing Python for years, the data science ecosystem was accessible in a way that R was not. You could start doing data analysis without learning a new language. You could put your analysis into production using the same deployment patterns as your other Python code. The boundary between "data scientist" and "software engineer" started to blur for teams using Python throughout their stack.

By the end of 2017, telling a data science hire that you used R primarily was a mild red flag. Not that R was bad, it remained excellent for certain statistical tasks. But the general-purpose data science stack was Python, and that preference has only solidified in the years since.

Related Articles