I think these arguments are mostly reasonable and I liked the article, but just a few nitpicks in defense of Python (drawing mostly on my experience being a research engineer in a fairly large decent-quality Pandas codebase):
With regard to call-by-reference, it's definitely a footgun for new Python programmers, and my best arguments are mostly that 1. with good code review it shouldn't really be a problem and most programmers should iron out these bugs within a ~year or less of using the language 2. it's not totally unique to Python – Java has basically the same model. That said, I think you're mostly right, and it would be nice if there were a way to remove mutability.
>It also doesn’t consistently poison computations with missing values. For example, it simply ignores them when computing sums or means, with no option to alter this behavior.
Is fill_null/drop_nulls not an option here? Or are you saying that there's no way to have functions like sum/mean throw on nulls and force the user to either fill or drop them without putting the onus on the user to decide that preemptively?
>And even if you make extensive use of a vectorized library, chances are you are also using built-in Python lists, because there’s always some place somewhere where a function wants a regular list as input or provides one as a return value.
Maybe this is a hot take but imo this is probably indicative of poor library design. Once something becomes a dataframe, it should remain that way; going from dataframes->raw python lists should never happen and this should be enforced across the codebase, but if you know of some scenario where this is unavoidable I'd be curious.
>The pandas package relies on lambda functions:
Just to say it, in this one example, you don't actually need the lambda. You can do `df.assign(bill_ratio=df[’bill_length_mm’] / df[’bill_depth_mm’])` and it will work fine, but in its full generality you do need lambda's to match R's functionality. Also, you don't need the enclosing parentheses to format that – most formatters like ruff or black will do the reasonable thing of doing something like
```
penguins.assign(
...
).sort_values(...)
```
> I have to create a temporary column because pandas cannot apply the cosine function to the bill_ratio column on the fly
in this two variable sort it is certainly a pain, but I do want to highlight that you can use the `key` arg to get similar functionality to R for the one variable case, e.g. `df.sort_values('a', key=np.cos)` is totally valid. For two variables it gets messy since `key` is applied to all the columns individually so you have to check the series name, and at that point what you have is probably cleaner. I agree that Polars handles this much better.
>For example, the same code in Polars is a little more concise and we don’t need a temporary column, but the constant need for pl.col() in Polars code can get old pretty fast
My also hot take is that pl.col is ubiquitous enough that it's reasonable to give a one-character alias like `c`
Regarding missing values, one experience I had just last week: A student was calculating standard deviations across many subsets of data and then filtering based on the value of the standard deviation. I asked him what happened when a subset had only one observation and the standard deviation was not defined. His response was that he hadn’t thought about it, and he wasn’t sure, but the code didn’t throw any errors. The reason the code didn’t throw errors is of course because in pandas a comparison to a missing value is simply False. In R, he would have had missing values in his final results, and it would have prompted him to look more carefully where they came from and what to do about them.
Regarding mixing lists with arrays, I was thinking about a data analyst enduser who is mixing different libraries required to achieve a given goal. I agree within a single library it would be poor design to mix data representations.
Thanks for pointing out ways to streamline pandas code. I’m not that familiar with how it works. For this post, I mostly generated examples using various LLMs and kept iterating until the code seemed sufficiently compact and minimal. Most code versions I got were much more convoluted!
Yeah, I think that's totally reasonable. I see the argument for more hard-failures with respect to missing values. That said, there is definitely a trade-off being made and there are definitely examples where silently dropping NAs or making NA comparisons false simplifies the code – for example, if you had a few students take multiple quizzes and wanted to check the average score of the quizzes, but some students missed some quizzes, you could have a dataset like
Quiz1 Quiz2 Quiz3
S1 80 83 75
S2 90 NA 88
S3 NA 70 40
S4 96 100 NA
if you wanted to, say, get the average score on each quiz, and .mean() throws on NA values, then you'd have to be careful dropping NAs because you don't want to drop rows that are mostly valid. So you'd have to do `df.fillna(0).sum() / df.count()`, whereas with the existing pandas behavior, you can just do `df.mean()` and it'll do what you want – recreating `df.std()` if it threw on NAs would be even more cumbersome!
There is also an argument that because the behavior is unsafe, dumb errors would occur more frequently, and junior programmers/researchers would be forced to think more carefully about what they want to do with their data by making such errors earlier in their career rather than later (provided they have correction mechanisms to spot the dumb errors, like you in the case of your story). e.g. I could imagine someone uncritically calling dropna or using fillna(0) in a world where pandas functions throw on NA values because that's the correct thing to do in many cases, but such behavior would create much harder-to-spot bugs further down the line (e.g. imagine calling df.fillna(0).mean() in the above example where you have 100s of students and very few NAs – wrong, but hard to detect!). I'm not married to this argument, but I think there is some basis for it – it's similar to Bjarne Stroustrup's quote about C vs C++, which he said has a more general analogue: "As you protect people from simple dangers, they get themselves into new and less obvious problems."
Yeah, that's why, regardless of what the default behavior is when encountering missing values, there should be an option to change that behavior. See e.g. the na.rm argument in R's `mean()` function: https://stat.ethz.ch/R-manual/R-devel/library/base/html/mean.html
These kinds of arguments exist all throughout the R api. I find it weird that pandas or polars don't have them. I don't think there's any technical reason not to implement them.
LLMs are terrible at producing sensible code. They err on the side of being confusing and verbose.
If you use an LLM to generate python and compare the code to your native fluent R code that you (an experienced human) might write, then you will draw distorted conclusions.
I have been using python for years. Call by reference bug seems a non-issue to me, and more a language style that you personally happen to really dislike.
Good point on the lack of a standard nan mechanic. I did not expect each package to handles it so differently, and honestly did not expect numpy's handling.
Python programmer here. The call by reference is annoying but, as others pointed out, many python programmers develop the instinct to avoid it (although it remains a danger). I also agree with the points you made about plotting; python is much clunkier than R for sure in this particular domain. In general it's a general purpose language first and a data science language by election, and it shows.
I have to say though, to stay on the meat of the article, that I am much more of a fan of the pandas verbosity than R evaluations. Python forces the programmer to be explicit in what they do, which in turn means that it is easier later on to understand what the intent was. This is not a theoretical musing: in 2019 I undertook a project to convert an R program into a python one using Dask (an optimised pandas for memory intensive applications - the one I'm thinking of was breaching one terabyte of RAM with a T). The original scientists absolutely loved chaining operations with non standard evaluation using one-letter variables within; it took me months to figure out what the heck the program was meant to do at every step. Give me programs where the programmer had to stop and think about the logistics and readability...
The call by reference issue really hits home. I've spent countless hours debugging code where a function unexpectdly modified my data. R's copy on write approach is definately safer, though I understnd the performance tradeoffs from the 90s. The lack of non standard evaluation really stands out when working with data pipelines, having to constantly wrap everything in lambdas or create temp columns makes the code feel much less elegant.
It's one of the more inexcusable issues with Python. The rationale behind the idea seems simple enough, pass by value or reference implicitly based on what you'd probably do explicitly in C++. Python just uses what would be your default choice. And that's fine if your user base consists of people who cut their teeth on C++ or some other "hard" language. It's not such a good idea for people whose first, and sometimes only, language is Python.
It's a case, and not the only one, where I think Python would be easier if it actually made what does a bit more explicit.
Mojo, which is based off of Python, has syntax for making call by value vs call by reference explicit. I think it's still something of a toy language though, but the concept is viable.
Perhaps it's time for someone to develop a Python++ (preferably with a better name) fork.
I've been enjoying the R/Python series. I'm more on the R camp for data analysis but I've picked up some Python for employability reasons. Do you have any thoughts on Git for data analysis?
Overall very good points. One major pain point in R is lack of a proper module/namespace structure (which python does extremely well) and lack of proper exception / traceback reporting, without jumping through extra hoops (which, again, python does very well).
This article comes at the perfect time, offering such an insightful breakdown of Python's languae features. Could you elaborate on how exactly the lack of non-standard evaluation impacts the design of data structures in pandas, beyond just the interface?
I was a big R fan for many years, and still do boot it up sometimes when it's unavoidable, but my new job is with python diehards and they're slowly converting me. I agree with Claus's points about NSE being a core feature of R. Not having it means not just having to do a lot of lambdas and temp columns, but just having to type a LOT more quotes, brackets, parentheses, ALL THE TIME. This is still a frustration I have with Python, but at least polars is much better than Pandas here.
I will also say to anyone who likes R and ggplot2, and must work in python, the lets-plot package is my new favorite python plotting library. https://lets-plot.org and https://github.com/JetBrains/lets-plot. I know there have many efforts to port a "grammar of graphics" functionality to python over the years, but none of them were intuitive enough for me to ever use seriously. But lets-plot seems (a) very faithful to the design of ggplot2, (b) is incorporating functionality of many common ggplot2 "extenders" like ggpointdensity and ggrepel into the main package, and (c) has built-in support for custom mouseover "tooltips". Plus the developers are responsive to bug reports. Two of mine have already been fixed. It's so much better than matplotlib.
It's more or less cut my list of objections to Python from two ("bad graphing", "too many brackets/lambdas") to one.
Thanks for pointing out lets-plot. I wasn't aware of it. Do you know plotnine? Do you know how it compares to lets-plot?
I'm somewhat wary of these one-to-one ports of ggplot2 because ggplot2 has some architectural weaknesses that unfortunately are now baked in but that we could do better without, and these exact ports copy those weaknesses also, instead of starting from a blank slate and fixing all the issues that are now known. Also, no matter how hard the authors of these packages work, they will always be missing some obscure ggplot2 feature that you absolutely need for that one plot. My lab now does a lot with plotnine, but in the end we port the code to R for final publication-quality figures because we want to make certain tweaks that are not supported in the python version.
I have heard of plotnine but have not used it very much at all. In fact I forgot it was also ggplot2-inspired. Maybe it is just as good or better than lets-plot. I should give it a look!
I think these arguments are mostly reasonable and I liked the article, but just a few nitpicks in defense of Python (drawing mostly on my experience being a research engineer in a fairly large decent-quality Pandas codebase):
With regard to call-by-reference, it's definitely a footgun for new Python programmers, and my best arguments are mostly that 1. with good code review it shouldn't really be a problem and most programmers should iron out these bugs within a ~year or less of using the language 2. it's not totally unique to Python – Java has basically the same model. That said, I think you're mostly right, and it would be nice if there were a way to remove mutability.
>It also doesn’t consistently poison computations with missing values. For example, it simply ignores them when computing sums or means, with no option to alter this behavior.
Is fill_null/drop_nulls not an option here? Or are you saying that there's no way to have functions like sum/mean throw on nulls and force the user to either fill or drop them without putting the onus on the user to decide that preemptively?
>And even if you make extensive use of a vectorized library, chances are you are also using built-in Python lists, because there’s always some place somewhere where a function wants a regular list as input or provides one as a return value.
Maybe this is a hot take but imo this is probably indicative of poor library design. Once something becomes a dataframe, it should remain that way; going from dataframes->raw python lists should never happen and this should be enforced across the codebase, but if you know of some scenario where this is unavoidable I'd be curious.
>The pandas package relies on lambda functions:
Just to say it, in this one example, you don't actually need the lambda. You can do `df.assign(bill_ratio=df[’bill_length_mm’] / df[’bill_depth_mm’])` and it will work fine, but in its full generality you do need lambda's to match R's functionality. Also, you don't need the enclosing parentheses to format that – most formatters like ruff or black will do the reasonable thing of doing something like
```
penguins.assign(
...
).sort_values(...)
```
> I have to create a temporary column because pandas cannot apply the cosine function to the bill_ratio column on the fly
in this two variable sort it is certainly a pain, but I do want to highlight that you can use the `key` arg to get similar functionality to R for the one variable case, e.g. `df.sort_values('a', key=np.cos)` is totally valid. For two variables it gets messy since `key` is applied to all the columns individually so you have to check the series name, and at that point what you have is probably cleaner. I agree that Polars handles this much better.
>For example, the same code in Polars is a little more concise and we don’t need a temporary column, but the constant need for pl.col() in Polars code can get old pretty fast
My also hot take is that pl.col is ubiquitous enough that it's reasonable to give a one-character alias like `c`
Regarding missing values, one experience I had just last week: A student was calculating standard deviations across many subsets of data and then filtering based on the value of the standard deviation. I asked him what happened when a subset had only one observation and the standard deviation was not defined. His response was that he hadn’t thought about it, and he wasn’t sure, but the code didn’t throw any errors. The reason the code didn’t throw errors is of course because in pandas a comparison to a missing value is simply False. In R, he would have had missing values in his final results, and it would have prompted him to look more carefully where they came from and what to do about them.
Regarding mixing lists with arrays, I was thinking about a data analyst enduser who is mixing different libraries required to achieve a given goal. I agree within a single library it would be poor design to mix data representations.
Thanks for pointing out ways to streamline pandas code. I’m not that familiar with how it works. For this post, I mostly generated examples using various LLMs and kept iterating until the code seemed sufficiently compact and minimal. Most code versions I got were much more convoluted!
Yeah, I think that's totally reasonable. I see the argument for more hard-failures with respect to missing values. That said, there is definitely a trade-off being made and there are definitely examples where silently dropping NAs or making NA comparisons false simplifies the code – for example, if you had a few students take multiple quizzes and wanted to check the average score of the quizzes, but some students missed some quizzes, you could have a dataset like
Quiz1 Quiz2 Quiz3
S1 80 83 75
S2 90 NA 88
S3 NA 70 40
S4 96 100 NA
if you wanted to, say, get the average score on each quiz, and .mean() throws on NA values, then you'd have to be careful dropping NAs because you don't want to drop rows that are mostly valid. So you'd have to do `df.fillna(0).sum() / df.count()`, whereas with the existing pandas behavior, you can just do `df.mean()` and it'll do what you want – recreating `df.std()` if it threw on NAs would be even more cumbersome!
There is also an argument that because the behavior is unsafe, dumb errors would occur more frequently, and junior programmers/researchers would be forced to think more carefully about what they want to do with their data by making such errors earlier in their career rather than later (provided they have correction mechanisms to spot the dumb errors, like you in the case of your story). e.g. I could imagine someone uncritically calling dropna or using fillna(0) in a world where pandas functions throw on NA values because that's the correct thing to do in many cases, but such behavior would create much harder-to-spot bugs further down the line (e.g. imagine calling df.fillna(0).mean() in the above example where you have 100s of students and very few NAs – wrong, but hard to detect!). I'm not married to this argument, but I think there is some basis for it – it's similar to Bjarne Stroustrup's quote about C vs C++, which he said has a more general analogue: "As you protect people from simple dangers, they get themselves into new and less obvious problems."
Yeah, that's why, regardless of what the default behavior is when encountering missing values, there should be an option to change that behavior. See e.g. the na.rm argument in R's `mean()` function: https://stat.ethz.ch/R-manual/R-devel/library/base/html/mean.html
These kinds of arguments exist all throughout the R api. I find it weird that pandas or polars don't have them. I don't think there's any technical reason not to implement them.
LLMs are terrible at producing sensible code. They err on the side of being confusing and verbose.
If you use an LLM to generate python and compare the code to your native fluent R code that you (an experienced human) might write, then you will draw distorted conclusions.
I have been using python for years. Call by reference bug seems a non-issue to me, and more a language style that you personally happen to really dislike.
Good point on the lack of a standard nan mechanic. I did not expect each package to handles it so differently, and honestly did not expect numpy's handling.
Python programmer here. The call by reference is annoying but, as others pointed out, many python programmers develop the instinct to avoid it (although it remains a danger). I also agree with the points you made about plotting; python is much clunkier than R for sure in this particular domain. In general it's a general purpose language first and a data science language by election, and it shows.
I have to say though, to stay on the meat of the article, that I am much more of a fan of the pandas verbosity than R evaluations. Python forces the programmer to be explicit in what they do, which in turn means that it is easier later on to understand what the intent was. This is not a theoretical musing: in 2019 I undertook a project to convert an R program into a python one using Dask (an optimised pandas for memory intensive applications - the one I'm thinking of was breaching one terabyte of RAM with a T). The original scientists absolutely loved chaining operations with non standard evaluation using one-letter variables within; it took me months to figure out what the heck the program was meant to do at every step. Give me programs where the programmer had to stop and think about the logistics and readability...
The call by reference issue really hits home. I've spent countless hours debugging code where a function unexpectdly modified my data. R's copy on write approach is definately safer, though I understnd the performance tradeoffs from the 90s. The lack of non standard evaluation really stands out when working with data pipelines, having to constantly wrap everything in lambdas or create temp columns makes the code feel much less elegant.
It's one of the more inexcusable issues with Python. The rationale behind the idea seems simple enough, pass by value or reference implicitly based on what you'd probably do explicitly in C++. Python just uses what would be your default choice. And that's fine if your user base consists of people who cut their teeth on C++ or some other "hard" language. It's not such a good idea for people whose first, and sometimes only, language is Python.
It's a case, and not the only one, where I think Python would be easier if it actually made what does a bit more explicit.
Mojo, which is based off of Python, has syntax for making call by value vs call by reference explicit. I think it's still something of a toy language though, but the concept is viable.
Perhaps it's time for someone to develop a Python++ (preferably with a better name) fork.
I've been enjoying the R/Python series. I'm more on the R camp for data analysis but I've picked up some Python for employability reasons. Do you have any thoughts on Git for data analysis?
Not sure what exactly you're asking. We do all our work in git repositories for reproducibility. See for example: https://github.com/ziul-bio/SWAT
Overall very good points. One major pain point in R is lack of a proper module/namespace structure (which python does extremely well) and lack of proper exception / traceback reporting, without jumping through extra hoops (which, again, python does very well).
This article comes at the perfect time, offering such an insightful breakdown of Python's languae features. Could you elaborate on how exactly the lack of non-standard evaluation impacts the design of data structures in pandas, beyond just the interface?
I was a big R fan for many years, and still do boot it up sometimes when it's unavoidable, but my new job is with python diehards and they're slowly converting me. I agree with Claus's points about NSE being a core feature of R. Not having it means not just having to do a lot of lambdas and temp columns, but just having to type a LOT more quotes, brackets, parentheses, ALL THE TIME. This is still a frustration I have with Python, but at least polars is much better than Pandas here.
I will also say to anyone who likes R and ggplot2, and must work in python, the lets-plot package is my new favorite python plotting library. https://lets-plot.org and https://github.com/JetBrains/lets-plot. I know there have many efforts to port a "grammar of graphics" functionality to python over the years, but none of them were intuitive enough for me to ever use seriously. But lets-plot seems (a) very faithful to the design of ggplot2, (b) is incorporating functionality of many common ggplot2 "extenders" like ggpointdensity and ggrepel into the main package, and (c) has built-in support for custom mouseover "tooltips". Plus the developers are responsive to bug reports. Two of mine have already been fixed. It's so much better than matplotlib.
It's more or less cut my list of objections to Python from two ("bad graphing", "too many brackets/lambdas") to one.
Thanks for pointing out lets-plot. I wasn't aware of it. Do you know plotnine? Do you know how it compares to lets-plot?
I'm somewhat wary of these one-to-one ports of ggplot2 because ggplot2 has some architectural weaknesses that unfortunately are now baked in but that we could do better without, and these exact ports copy those weaknesses also, instead of starting from a blank slate and fixing all the issues that are now known. Also, no matter how hard the authors of these packages work, they will always be missing some obscure ggplot2 feature that you absolutely need for that one plot. My lab now does a lot with plotnine, but in the end we port the code to R for final publication-quality figures because we want to make certain tweaks that are not supported in the python version.
I have heard of plotnine but have not used it very much at all. In fact I forgot it was also ggplot2-inspired. Maybe it is just as good or better than lets-plot. I should give it a look!
I'm surprised no one's mentioned the lack of private functions or variables in Python classes. That always struck me as asking for trouble.