Python is not a great language for data science. Part 1: The experience
It may be a good language for data science, but it’s not a great one.
Yes, I’m ready to touch the hot stove. Let the language wars begin.
Actually, the first thing I’ll say is this: Use the tool you’re familiar with. If that’s Python, great, use it. And also, use the best tool for the job. If that’s Python, great, use it. And also, it’s Ok to use a tool for one task just because you’re already using it for all sorts of other tasks and therefore you happen to have it at hand. If you’re hammering nails all day it’s Ok if you’re also using your hammer to open a bottle of beer or scratch your back. Similarly, if you’re programming in Python all day it’s Ok if you’re also using it to fit mixed linear models. If it works for you, great! Keep going. But if you’re struggling, if things seem more difficult than they ought to be, this article series may be for you.

I think people way over-index Python as the language for data science. It has limitations that I think are quite noteworthy. There are many data-science tasks I’d much rather do in R than in Python.1 I believe the reason Python is so widely used in data science is a historical accident, plus it being sort-of Ok at most things, rather than an expression of its inherent suitability for data-science work.
At the same time, I think Python is pretty good for deep learning.2 There’s a reason PyTorch is the industry standard. When I’m talking about data science here, I’m specifically excluding deep learning. I’m talking about all the other stuff: data wrangling, exploratory data analysis, visualization, statistical modeling, etc. And, as I said in my opening paragraphs, I understand that if you’re already working in Python all day for a good reason (e.g., training AI models) you may also want to do all the rest in Python. I’m doing this myself, in the deep-learning classes I teach. This doesn’t mean I can’t be frustrated by how cumbersome data science often is in the Python world.
Observations from the trenches
Let’s begin with my lived experience, without providing any explanation for what may be the cause of it. I have been running a research lab in computational biology for over two decades. During this time I have worked with around thirty graduate students and postdocs, all very competent and accomplished computational scientists. The policy in my lab is that everybody is free to use whatever programming language and tools they want to use. I don’t tell people what to do. And more often than not, people choose Python as their programming language of choice.
So here is a typical experience I commonly have with students who use Python. A student comes to my office and shows me some result. I say “This is great, but could you quickly plot the data in this other way?” or “Could you quickly calculate this quantity I just made up and let me know what it looks like when you plot it?” or similar. Usually, the request I make is for something that I know I could do in R in just a few minutes. Examples include converting boxplots into violins or vice versa, turning a line plot into a heatmap, plotting a density estimate instead of a histogram, performing a computation on ranked data values instead of raw data values, and so on. Without fail, from the students that use Python, the response is: “This will take me a bit. Let me sit down at my desk and figure it out and then I’ll be back.” Now let me be absolutely clear: These are strong students. The issue is not that my students don’t know their tools. It very much seems to me to be a problem of the tools themselves. They appear to be sufficiently cumbersome or confusing that requests that I think should be trivial frequently are not.3
No matter the cause of this experience, I have to conclude that there is something fundamentally broken with how data analysis works in Python. It may be a problem with the language itself, or merely a limitation of the available software libraries, or a combination thereof, but whatever it is, its effects are real and I see them routinely. In fact, I have another example, in case you’re tempted to counter, “It’s a skill issue; get better students.” Last fall, I co-taught a class on AI models for biology with an experienced data scientist who does all his work in Python. He knows NumPy and pandas and matplotlib like the back of his hand. In the class, I covered all the theory, and he covered the in-class exercises in Python. So I got to see an expert in Python working through a range of examples. And my reaction to the code examples frequently was, “Why does it have to be so complicated?” So many times, I felt that things that would be just a few lines of simple R code turned out to be quite a bit longer and fairly convoluted. I definitely could not have written that code without extensive studying and completely rewiring my brain in terms of what programming patterns to use. It felt very alien, but not in the form of “wow, this is so alien but also so elegant” but rather “wow, this is so alien and weird and cumbersome.” And again, I don’t think this is because my colleague is not very good at what he’s doing. He is extremely good. The problem appears to be in the fundamental architecture of the tools.
Some general thoughts about what makes a good language for data science
Let me step back for a moment and go over some basic considerations for choosing a language for data science. When I say data science, I mean dissecting and summarizing data, finding patterns, fitting models, and making visualizations. In brief, it’s the kind of stuff scientists and other researchers4 do when they are analyzing their data. This activity is distinct from data engineering or application development, even if the application does a data-heavy workload.
Data science as I define it here involves a lot of interactive exploration of data and quick one-off analyses or experiments. Therefore, any language suitable for data science has to be interpreted, usable in an interactive shell or in a notebook format. This also means performance considerations are secondary. When you want to do a quick linear regression on some data you’re working with, you don’t care whether the task is going to take 50 milliseconds or 500 milliseconds. You care about whether you can open up a shell, type a few lines of code, and get the result in a minute or two, versus having to set up a new project, writing all the boilerplate to make the compiler happy, and then spend more time compiling your code than running it.
If we accept that being able to work interactively and with low startup-cost is a critical feature of a language for data science, we immediately arrive at scripting languages such as Python, or data-science specific languages such as R or Matlab or Mathematica. There’s also Julia, but honestly I don’t know enough about it to write about it coherently. For all I know it’s the best possible data science language out there. But I note that some people who have used it extensively have doubts. Either way, I’ll not discuss it further here. I’ll also not consider proprietary languages such as Matlab or Mathematica, or fairly obscure languages lacking a wide ecosystem of useful packages, such as Octave. This leaves us with R and Python as the realistic choices to consider.5
Before continuing, let me provide a few more thoughts about performance. Performance usually trades off with other features of a language. In simplistic terms, performance comes at the cost of either extra overhead for the programmer (as in Rust) or increased risk of obscure bugs (as in C) or both. For data science applications, I consider a high risk of obscure bugs or incorrect results as not acceptable, and I also think convenience for the programmer is more important than raw performance. Computers are fast and thinking hurts. I’d rather spend less mental energy on telling the computer what to do and wait a little longer for the results. So the easier a language makes my job for me, the better. If I am really performance-limited in some analysis, I can always rewrite that particular part of the analysis in Rust, once I know exactly what I’m doing and what computations I need.
Separating the logic from the logistics
A critical component of not making my job harder than it needs to be is separating the logic of the analysis from the logistics. What I mean by this is I want to be able to specify at a conceptual level how the data should be analyzed and what the outcome of the computation should be, and I don’t want to have to think about the logistics of how the computation is performed. As a general rule, if I have to think about data types, numerical indices, or loops, or if I have to manually disassemble and reassemble datasets, chances are I’m bogged down in logistics.6
To provide a concrete example, consider the dataset of penguins from the Palmer Archipelago. There are three different penguin species in the dataset, and the penguins live on three different islands. Assume I want to calculate the mean and standard deviation of penguin weight for every combination of penguin species and island, excluding any cases where the body weight of a penguin is not known. An ideal data science language would allow me to express this computation in these terms, and it would require approximately as much code as it took me to write this sentence in the English language. And indeed this is possible, both in R and in Python.
Here is the relevant code in R, using the tidyverse approach:
library(tidyverse)
library(palmerpenguins)
penguins |>
filter(!is.na(body_mass_g)) |>
group_by(species, island) |>
summarize(
body_weight_mean = mean(body_mass_g),
body_weight_sd = sd(body_mass_g)
)And here is the equivalent code in Python, using the pandas package:
import pandas as pd
from palmerpenguins import load_penguins
penguins = load_penguins()
(penguins
.dropna(subset=['body_mass_g'])
.groupby(['species', 'island'])
.agg(
body_weight_mean=('body_mass_g', 'mean'),
body_weight_sd=('body_mass_g', 'std')
)
.reset_index()
)These two examples are quite similar. At this level of complexity of the analysis, Python does fine. I would consider the R code to be slightly easier to read (notice how many quotes and brackets the Python code needs), but the differences are minor. In both cases, we take the penguins dataset, remove the penguins for which body weight is missing, then specify that we want to perform the computation separately on every combination of penguin species and island, and then calculate the means and standard deviations.
Contrast this with equivalent code that is full of logistics, where I’m using only basic Python language features and no special data wrangling package:
from palmerpenguins import load_penguins
import math
penguins = load_penguins()
# Convert DataFrame to list of dictionaries
penguins_list = penguins.to_dict('records')
# Filter out rows where body_mass_g is missing
filtered = [row for row in penguins_list if not math.isnan(row['body_mass_g'])]
# Group by species and island
groups = {}
for row in filtered:
key = (row['species'], row['island'])
if key not in groups:
groups[key] = []
groups[key].append(row['body_mass_g'])
# Calculate mean and standard deviation for each group
results = []
for (species, island), values in groups.items():
n = len(values)
# Calculate mean
mean = sum(values) / n
# Calculate standard deviation
variance = sum((x - mean) ** 2 for x in values) / (n - 1)
std_dev = math.sqrt(variance)
results.append({
'species': species,
'island': island,
'body_weight_mean': mean,
'body_weight_sd': std_dev
})
# Sort results to match order used by pandas
results.sort(key=lambda x: (x['species'], x['island']))
# Print results
for result in results:
print(f"{result['species']:10} {result['island']:10} "
f"Mean: {result['body_weight_mean']:7.2f} g, "
f"SD: {result['body_weight_sd']:6.2f} g")This code is much longer, it contains numerous loops, and it explicitly pulls the dataset apart and then puts it back together again. Regardless of language choice, I hope you can see that the version without logistics is superior to the version that gets bogged down in logistical details.7
I will end things here for now. This post is long enough. In future installments, I’ll go over specific issues that make data analysis more complicated in Python than in R. In brief, I believe there are several reasons why Python code often devolves into dealing with data logistics. As much as the programmer may try to avoid logistics and stick to high-level conceptual programming patterns, either the language itself or the available libraries get in the way and tend to thwart those efforts. I will go into details soon. Stay tuned.
More from Genes, Minds, Machines
LLMs excel at programming—how can they be so bad at it?
Despite the overall hype in all things AI, in particular among the tech crowd, we have not yet seen much in terms of product–market fit and genuine commercial success for AIs—or more specifically, LLMs—outside a fairly narrow range of application areas. Other than sycophantic chatbots, AI girlfriends, and maybe efficient document search, the main applic…
No, AlphaFold has not completely solved protein folding
AlphaFold has captured the imagination of people outside biology to an extent not normally seen for a technical tool of computational biology. No tech bro in Silicon Valley has an opinion on HMMER, BLAST, or FoldX, or their potential impact on the future of humanity. But when it comes to
In terms of languages that are commonly used for data science, I’m only familiar with R and Python, so those are the languages I’ll compare here. There may be some other language you are familiar with that solves all the issues I’m raising. Maybe it’s Julia, or Ruby, or Haskel. Great. If you like it, use it.
At least in the way that deep learning is practiced today. In my opinion, the fact that PyTorch (or TensorFlow) code requires us to explicitly manipulate tensors and think about dimensions and what data is stored where suggests to me that there’s a level of abstraction we haven’t figured out yet. In other data analysis tasks, we no longer have to do these things.
The plotting examples I list here are non-issues for students who use plotnine, which I’m now encouraging everybody in my lab to do. But for students who use matplotlib or seaborn, which seem to be much more common choices in the Python community, I’ve never seen a student who could actually, on the fly, modify a plot in a meaningful manner.
I’m writing “researchers” in addition to “scientists” because people such as economists or journalists also often do data science, and I don’t think we’d call either type of person a scientist. I think “researcher” is a more general term that can apply to anybody who researches something, regardless of whether it’s science or not.
Once upon a time there was Perl, but thankfully everybody agreed Perl was not a great language for anything. Python’s success is in no small part due to being better than Perl at most everything that Perl was good at.
This is my main criticism of current deep-learning code that I alluded to in Footnote 2. It’s all logistics. Where is the deep-learning framework that abstracts away all the logistics and allows me to express only the logic of the information flow through the network?
Doing the same experiment with only base-R functionality feels like cheating. We can express the entire operation in a single function call:aggregate(body_mass_g ~ species + island, penguins, \(x) c(mean = mean(x), sd = sd(x)))This example highlights how powerful R is for data analysis. It also explains one of the main criticisms leveled at the tidyverse by the base-R community, that the tidyverse is overly verbose and is just reinventing concepts that have been available in R since the dawn of time.


