Python is not a great language for data science. Part 2: Language features

It may be a good language for data science, but it’s not a great one.

Nov 17, 2025

This is Part 2 of my series on the limitations of Python as a language for data science. You can find Part 1 here. Please read it first if you haven’t done so yet. It provides important context.

I normally find it tedious to discuss suitability of different programming languages for different tasks. All languages we use are Turing complete, and we can solve any problem with any language. And, more importantly, the suitability of a language for a given task is usually more determined by the available software libraries and ecosystem infrastructure than the language itself. Modern programming languages are quite malleable, and you can write efficient and elegant libraries for almost any computing task in almost any language.

At the same time, there are genuine differences between languages, and these differences are frequently expressed in the types of libraries that get written or the types of programming patterns that are commonly used. The differences can be due to specific features of the language, or they can be rooted in how the community thinks about programming and how it tends to approach certain tasks.

To give an example of each case, consider first non-standard evaluation. Python doesn’t have non-standard evaluation, and that’s a genuine limitation of the language which leads to convoluted programming interfaces for libraries such as pandas or Polars. On the other hand, consider closures. Python has them but they are not that widely used by Python programmers. The Python community will generally lean towards implementing objects instead of closures, when the R community does the opposite. This leads to different coding styles that may or may not be advantageous in specific scenarios.1

Thanks for reading Genes, Minds, Machines! This post is public so feel free to share it.

Here, I want to focus specifically on actual limitations of the language. I will cover community conventions in a later article. The core problems I see with Python as a language for data science are call-by-reference semantics, lack of built-in concepts of missing values, lack of built-in vectorization, and lack of non-standard evaluation. There’s also the issue of Python syntax, but I won’t get into it here. Suffice to say it takes a certain lack of empathy for your fellow human to design a language where whitespace bugs are a thing.

Call-by-reference semantics

Python uses call by reference for mutable objects. This means that when you hand a mutable object to a function the function can change the object however it wants. You can never be sure that the object hasn’t changed after the function call. What are mutable objects? They are all the non-trivial data structures you are likely going to use to store your data, including lists, dictionaries, and any custom classes you may be working with.

To demonstrate this feature, consider the following code example, which attempts to implement a function that takes a list of characters, replaces the first and last with an underscore, and then concatenates all the characters into a string. To a naive Python programmer, the implementation may seem entirely reasonable, but it has the unexpected side effect that it changes the original list that was provided as input.

def mask_ends_and_join(x):
    x[0] = '_'
    x[-1] = '_'
    return ''.join(x)

abc = ['A', 'B', 'C']
print(mask_ends_and_join(abc))
## _B_

print(abc) # the list has unexpectedly changed
## ['_', 'B', '_']

To demonstrate that an interactive scripting language with dynamic typing doesn’t have to behave in this manner, consider the equivalent in R:

mask_ends_and_join <- function(x) {
  x[1] <- '_'
  x[length(x)] <- '_'
  paste0(x, collapse = '')
}

abc <- c('A', 'B', 'C')
print(mask_ends_and_join(abc))
## [1] "_B_"

print(abc) # the original vector of letters is unchanged
## [1] "A" "B" "C"

I think the latter is much safer behavior. I want my programming language to protect me from silly mistakes such as accidentally modifying variables in the calling environment. I don’t want the language to create trap doors left and right. In fact, I consider call by reference one of the biggest flaws in the Python language. This goes way beyond just data science, because mandatory call by reference creates an entire class of obscure bugs that can be difficult to locate and resolve. Many beginning Python programmers fall into this trap. They write a function like mask_ends_and_join(), and then they experience unexpected side effects, and then they’re confused and feel nothing makes sense. Experienced Python programmers know to make a copy before modifying the list, but the language itself provides absolutely no protection against the programmer forgetting to do so.2

In my opinion, this single language feature disqualifies Python for most serious programming projects. How can you build anything that matters in a language with such a gaping security hole? In fact, you may wonder, why does the language behave in this way in the first place? I consider it to be the result of a premature optimization. In the 1990s, when Python was first conceived, computers were slow and had little memory, and thus call by reference for objects was a reasonable strategy to build a scripting language with good performance. But in 2025, I would not want to see this as the default approach to function calling. R uses copy on write and that works great and provides correctness guarantees that Python simply can’t match.3 Alternatively, you could use a strongly typed language that precisely distinguishes between mutable and immutable references, but then you’ve likely left the space of easy-to-use scripting languages suitable for interactive data exploration.

Lack of built-in missing values

Missing values are a fact of life in data science. It’s rare that a dataset does not have any missing values. Yet it’s surprisingly cumbersome to deal with missing values in Python. Python has the None keyword but it is not useful to represent missing data values. This is because None has its own type, so it can’t represent a missing number, or a missing boolean, or a missing string. It is an object representing a missing value. Critically, you can’t do standard computations with None. For example, this code throws an error:

x = [1, 2, None, 4, 5]
[i > 3 for i in x]
## Traceback (most recent call last):
##   File "<stdin>", line 1, in <module>
## TypeError: '>' not supported between instances of 'NoneType' and 'int'

The desired behavior, in my opinion, would have been to not error out and instead produce this result: [False, False, None, True, True].

Because there is no standard way of expressing missing data values in Python, every data-analysis package defines its own missing value. NumPy uses nan, pandas uses NA, and Polars uses null. And these packages are also not consistent in how they perform computations with missing values. Here is what NumPy does:

import numpy as np
 
x = np.array([1, 2, np.nan, 4, 5])
x > 3
## array([False, False, False,  True,  True])

And here is what pandas does:

import pandas as pd

x = pd.Series([1, 2, pd.NA, 4, 5])
x > 3
## 0    False
## 1    False
## 2    False
## 3     True
## 4     True
## dtype: bool

And here is what Polars does:4

import polars as pl
 
x = pl.Series([1, 2, None, 4, 5])
x > 3
## shape: (5,)
## Series: '' [bool]
## [
## 	false
## 	false
## 	null
## 	true
## 	true
## ]

In these three cases, in my opinion only Polars handles missing values correctly. Missing values should poison downstream computations, so that you don’t accidentally compute on missing data and get incorrect results. Neither NumPy nor pandas do this. But don’t get your hopes up for Polars. It also doesn’t consistently poison computations with missing values. For example, it simply ignores them when computing sums or means, with no option to alter this behavior.5

Lack of built-in vectorization

Vectorization is the ability to perform computations on an entire array of data values at once, rather than one value at a time. It is a common feature in early languages used for scientific computation, such as Fortran or Matlab. It is also the default approach to data manipulation in R.

Today, vectorization is often seen as anachronistic. Few modern languages have support for it at the level of the language itself. One notable exception is Julia, a relatively young language developed specifically for data science. Also, ironically, all of deep learning is built on vectorization. (A tensor is a modern version of a vectorized data type.)

The reason vectorization is frequently not considered critical in modern languages is that the feature can be provided via libraries, using the various extension mechanisms all modern languages possess. And indeed, vectorization in Python is provided through libraries such as NumPy, pandas, or Polars. While this works, I have come to believe that it is not a good strategy for a data-science language. It has a tendency to lead to a bewildering array of different implementations of vector-valued data types. In Python, we have (at a minimum) native lists, which are not vectorized, as well as NumPy arrays, pandas series, and Polars series, all vectorized, and all using slightly different conventions and APIs. The outcome is code that is not composable. Downstream libraries make assumptions about which vectorization framework to use, and they typically cannot work directly with data coming from other frameworks.6 As a working data scientist, you routinely find yourself converting one datatype into another, just to be able to do the exact analysis you want to do.

And even if you make extensive use of a vectorized library, chances are you are also using built-in Python lists, because there’s always some place somewhere where a function wants a regular list as input or provides one as a return value. And then you’re stuck having to manipulate those lists. You could convert them into NumPy arrays, do some vectorized manipulations, and convert back, but in practice you’re probably not going to do this. Instead, you’re going to write a list comprehension instead. So now you’re using two entirely different coding styles at the same time, depending on the data type you’re using to store your vector-valued data.

Let’s ponder list comprehensions for a bit longer. They are inherently a functional programming pattern, but the way they are implemented in Python makes them appear as if they were imperative programming. By using the for keyword and emphasizing iteration over a range of values, they constantly nudge you to think in iterative terms even though conceptually they’re closer to a map() than to a for loop. To be clear, I have no objection to list comprehensions. They are a useful feature, in particular when you’re manipulating built-in Python lists that have no vectorization. But they are one more example of Python constantly nudging you to think about the logistics of your data analysis. When you’re writing list comprehensions all day, you’re likely also going to write for loops in other parts of your code, and then you’re back juggling indices and explicitly handling logistics instead of thinking high-level about the logic of data flow in your code.

Lack of non-standard evaluation

Non-standard evaluation is probably the most important feature for data science that Python lacks. It is a core feature of the R language and the main reason why tidyverse code can be so elegant and concise, or why R has developed the elegant formula interface for the specification of statistical models.

What is non-standard evaluation? In brief, it’s the ability to perform computations on the language itself. An R function can capture R code that is provided as an argument and execute it at a later stage in a different environment. This is a critical feature in data analysis. You often want to perform computations involving the various columns in a data frame, or use code to express the exact relationship between different variables in a statistical model. In R, you can express these computations in native R code, for example code that looks as if the columns in a data frame were regular R variables available for computation in your current environment. Combined with vectorization, this makes for extremely concise code.

To demonstrate non-standard evaluation in action, I’ll provide a simple example using the penguins dataset. Let’s calculate a new variable bill_ratio which is the ratio of bill length to bill depth of the penguins, and then sort the resulting data frame in ascending order by island name and in descending order by bill ratio. In R, it looks like this:

library(tidyverse)
library(palmerpenguins)

penguins |> 
  mutate(bill_ratio = bill_length_mm / bill_depth_mm) |>
  arrange(island, desc(bill_ratio))

There are two places here where non-standard evaluation comes into play. First, inside mutate(), the calculation of the bill ratio is standard R code that is executed inside the input data frame, with the data columns being available as ordinary R variables. Second, inside arrange(), we use desc() which changes an ascending column into a descending one. The desc() function is a bit magical but for numerical columns you can think of it as simply multiplying the data values by -1.

When we do the same analysis in Python, we don’t have non-standard evaluation available, and so we have to use various workarounds. The pandas package relies on lambda functions:

import pandas as pd
from palmerpenguins import load_penguins

penguins = load_penguins()

(penguins
 .assign(
     bill_ratio=lambda df: df[’bill_length_mm’] / df[’bill_depth_mm’]
 )
 .sort_values(
     [’island’, ‘bill_ratio’],
     ascending=[True, False]
 )
)

I think it’s obvious that non-standard evaluation helps a lot to keep the code simple and readable.7 Now let’s go a step further. Assume I want to sort by the cosine of bill length. Yes, it’s a made-up example, but it’s an example of exactly the type of question I might ask a student, as described in Part 1 of this series. Instead of descending order use cosine order. How hard can it be?

With non-standard evaluation, the required modification is trivial and totally obvious. Instead of desc() we write cos(). Done.

penguins |> 
  mutate(bill_ratio = bill_length_mm / bill_depth_mm) |>
  arrange(island, cos(bill_ratio))

In Python (specifically pandas which I’m using here, but most other frameworks require similarly awkward coding patterns), without non-standard evaluation, I have to create a temporary column because pandas cannot apply the cosine function to the bill_ratio column on the fly:

import numpy as np

(penguins
 .assign(
     bill_ratio=lambda df: df['bill_length_mm'] / df['bill_depth_mm'],
     cos_bill_ratio=lambda df: np.cos(df['bill_ratio'])
 )
 .sort_values(['island', 'cos_bill_ratio'])
 .drop(columns=['cos_bill_ratio']) # drop temporary column
)

The amount of additional wrangling code required to perform such a simple task is quite substantial. Now we need to define two lambda functions and a temporary data column. Also, we no longer need the ascending argument, because while there is built-in support for sorting in ascending or descending order, there is no built-in support for sorting in cosine order.

To be fair, the pandas syntax is maybe particularly cumbersome here, and things can look nicer in other frameworks. But the lack of non-standard evaluation always gets in the way in some form. For example, the same code in Polars is a little more concise and we don’t need a temporary column, but the constant need for pl.col() in Polars code can get old pretty fast.

import polars as pl

penguins = pl.from_pandas(load_penguins())

(penguins
 .with_columns(
     bill_ratio=(pl.col('bill_length_mm') / pl.col('bill_depth_mm'))
 )
 .sort(['island', pl.col('bill_ratio').cos()])
)

Non-standard evaluation has been a feature of the R language since its inception, but it has been supercharged in the tidyverse. I would argue that a full understanding of how to use it correctly, with maximum expressiveness while avoiding convoluted code, is a relatively recent development. Important changes were introduced as recently as June 2019. Considering the first ggplot2 release was in 2007, we can see that it took Hadley Wickham and his team over a decade to figure out how to do non-standard evaluation correctly. It is maybe not surprising that these concepts have not yet percolated far beyond their originating language.

Limitations of the R language

To stave off criticism that I’m just an R apologist and Python hater, let me briefly point out some specific flaws I see in the R language. In my opinion, these flaws get in the way of R as a general-purpose language for application development, but they are less relevant for data science.

Most importantly, it bothers me that R does not have any scalar data types. R has taken vectorization to the point where you can’t even have a variable that is not a vector. This makes for awkward programming when you’re trying to deal with individual data values. R code frequently requires special gymnastics to ensure you’re not accidentally feeding a whole vector of values into an expression that expects only a single value.

It’s also annoying that R doesn’t have a proper, language-native object-oriented programming paradigm. The result is people often build their own, and there are so many competing options. Off the top of my head, I can think of S3, S4, R6, S7, and some others that are less commonly used. It can be quite confusing trying to figure out which one to choose, and they don’t necessarily have perfect interoperability.

Finally, R uses lazy evaluation of function arguments. This means function arguments are not evaluated when the function is called, but only when and if the function requests the specific value corresponding to an argument. Lazy evaluation is critical for R’s non-standard evaluation framework, but it can lead to weird bugs, in particular when people try to use R in an imperative rather than functional manner. It’s a common source of spurious bug reports for ggplot2, see e.g. here or here. It’s also frequently asked about on StackOverflow.

I am pointing out these limitations of the R language to highlight that any design decision involves tradeoffs. Non-standard evaluation is great for data science, but it requires lazy evaluation, and that is not a good choice for languages used primarily in an imperative manner and/or for standard programming tasks such as application development. There is never going to be a language that does all possible things equally well. And, to circle around to the title of my article series here, for my taste there are too many design choices in Python that are detrimental to efficient and reliable data science, even if these choices are perfectly reasonable for other application areas.

In the next installment of this series, I will look at Python’s limitations due to the available software packages and due to community conventions and commonly used programming patterns. Stay tuned.

Python is not a great language for data science. Part 1: The experience

Claus Wilke

Nov 13

Python is not a great language for data science. Part 1: The experience

Yes, I’m ready to touch the hot stove. Let the language wars begin.

Read full story

We still can’t predict much of anything in biology

Claus Wilke

Oct 7

We still can’t predict much of anything in biology

AI has gotten amazingly good for programming. Claude Sonnet will zero- or one-shot small programming tasks without mistakes. And while I don’t think AI is ready to replace software engineers outright, or that vibe coding a fully featured app is a good idea, for simple tasks AI is outstanding. For example, I can perform basic data analysis, maybe visuali…

Read full story

I’m not arguing here that closures are superior to objects. They are not. Each has their place. I just want to highlight a language feature that exists in Python but is not that widely used by the community.

And the problem gets worse when inside the function body you’re using methods to manipulate objects, because whenever you call a method of an object there’s the risk that the method has subtly modified the object, without you knowing or realizing. This can happen in ways that are not at all obvious, such as a method changing some internal state that only rarely matters. The point is you can never be certain an object hasn’t changed state when you call one of its methods.

I’m sure somebody is going to bring up performance issues with copy on write. I’ll just say read my comments on performance in Part 1. If performance is critical in your application, you’re probably better off with Rust anyways. And also, it’s difficult for me to imagine many scenarios where performance matters but correctness of results does not.

Note a weird aspect of Polars compared to NumPy or pandas: I cannot use the Polars null type to initialize a series holding a missing value. Instead I have to write None.

I know this is what SQL does. It doesn’t mean it’s the right choice. Silently ignoring missing values all but guarantees that some data scientist somewhere is arriving at flawed conclusions because they didn’t realize they had missing values in their data.

For example, the plotting library plotnine cannot plot Polars data frames without first converting them into pandas format.

Also, as an aside, can we reflect for a moment on Python’s need for enclosing parentheses to format the data-manipulation chain nicely? I’ve long found the Python code formatting requirements to be rather frustrating. This is one more example.

Aman Karunakaran

I think these arguments are mostly reasonable and I liked the article, but just a few nitpicks in defense of Python (drawing mostly on my experience being a research engineer in a fairly large decent-quality Pandas codebase):

With regard to call-by-reference, it's definitely a footgun for new Python programmers, and my best arguments are mostly that 1. with good code review it shouldn't really be a problem and most programmers should iron out these bugs within a ~year or less of using the language 2. it's not totally unique to Python – Java has basically the same model. That said, I think you're mostly right, and it would be nice if there were a way to remove mutability.

>It also doesn’t consistently poison computations with missing values. For example, it simply ignores them when computing sums or means, with no option to alter this behavior.

Is fill_null/drop_nulls not an option here? Or are you saying that there's no way to have functions like sum/mean throw on nulls and force the user to either fill or drop them without putting the onus on the user to decide that preemptively?

>And even if you make extensive use of a vectorized library, chances are you are also using built-in Python lists, because there’s always some place somewhere where a function wants a regular list as input or provides one as a return value.

Maybe this is a hot take but imo this is probably indicative of poor library design. Once something becomes a dataframe, it should remain that way; going from dataframes->raw python lists should never happen and this should be enforced across the codebase, but if you know of some scenario where this is unavoidable I'd be curious.

>The pandas package relies on lambda functions:

Just to say it, in this one example, you don't actually need the lambda. You can do `df.assign(bill_ratio=df[’bill_length_mm’] / df[’bill_depth_mm’])` and it will work fine, but in its full generality you do need lambda's to match R's functionality. Also, you don't need the enclosing parentheses to format that – most formatters like ruff or black will do the reasonable thing of doing something like

```

penguins.assign(

...

).sort_values(...)

> I have to create a temporary column because pandas cannot apply the cosine function to the bill_ratio column on the fly

in this two variable sort it is certainly a pain, but I do want to highlight that you can use the `key` arg to get similar functionality to R for the one variable case, e.g. `df.sort_values('a', key=np.cos)` is totally valid. For two variables it gets messy since `key` is applied to all the columns individually so you have to check the series name, and at that point what you have is probably cleaner. I agree that Polars handles this much better.

>For example, the same code in Polars is a little more concise and we don’t need a temporary column, but the constant need for pl.col() in Polars code can get old pretty fast

My also hot take is that pl.col is ubiquitous enough that it's reasonable to give a one-character alias like `c`

Expand full comment

Genes, Minds, Machines