Discussion about this post

User's avatar
Aman Karunakaran's avatar

I think these arguments are mostly reasonable and I liked the article, but just a few nitpicks in defense of Python (drawing mostly on my experience being a research engineer in a fairly large decent-quality Pandas codebase):

With regard to call-by-reference, it's definitely a footgun for new Python programmers, and my best arguments are mostly that 1. with good code review it shouldn't really be a problem and most programmers should iron out these bugs within a ~year or less of using the language 2. it's not totally unique to Python – Java has basically the same model. That said, I think you're mostly right, and it would be nice if there were a way to remove mutability.

>It also doesn’t consistently poison computations with missing values. For example, it simply ignores them when computing sums or means, with no option to alter this behavior.

Is fill_null/drop_nulls not an option here? Or are you saying that there's no way to have functions like sum/mean throw on nulls and force the user to either fill or drop them without putting the onus on the user to decide that preemptively?

>And even if you make extensive use of a vectorized library, chances are you are also using built-in Python lists, because there’s always some place somewhere where a function wants a regular list as input or provides one as a return value.

Maybe this is a hot take but imo this is probably indicative of poor library design. Once something becomes a dataframe, it should remain that way; going from dataframes->raw python lists should never happen and this should be enforced across the codebase, but if you know of some scenario where this is unavoidable I'd be curious.

>The pandas package relies on lambda functions:

Just to say it, in this one example, you don't actually need the lambda. You can do `df.assign(bill_ratio=df[’bill_length_mm’] / df[’bill_depth_mm’])` and it will work fine, but in its full generality you do need lambda's to match R's functionality. Also, you don't need the enclosing parentheses to format that – most formatters like ruff or black will do the reasonable thing of doing something like

```

penguins.assign(

...

).sort_values(...)

```

> I have to create a temporary column because pandas cannot apply the cosine function to the bill_ratio column on the fly

in this two variable sort it is certainly a pain, but I do want to highlight that you can use the `key` arg to get similar functionality to R for the one variable case, e.g. `df.sort_values('a', key=np.cos)` is totally valid. For two variables it gets messy since `key` is applied to all the columns individually so you have to check the series name, and at that point what you have is probably cleaner. I agree that Polars handles this much better.

>For example, the same code in Polars is a little more concise and we don’t need a temporary column, but the constant need for pl.col() in Polars code can get old pretty fast

My also hot take is that pl.col is ubiquitous enough that it's reasonable to give a one-character alias like `c`

Expand full comment

No posts