40 Comments
User's avatar
Trevor Freeman's avatar

I think R+tidyverse is marginally nicer for basic data cleaning and plotting. Although for me Polars+Plotnine has made the gap pretty insignificant, and I have well over 5 years experience with R+tidyverse over Polars+Plotnine, so it's hard to directly compare.

The problem with R is that doing anything besides chaining together a bunch of tidyverse functions in a notebook absolutely sucks. Every time I go to write a little module that I want to be reusable and portable and tested and that handles errors I want to pull my hair out. Little things like implementing a command line interface for a script or two are so much nicer in Python that I think it's well worth giving up the marginal benefits of the tidyverse. Don't even get me started on basic utilities like logging, implementing custom classes, built-in support for modules and virtual environments, native assertions for zero effort sanity checks in the code, not having the entire developer experience revolve around RStudio/Positron, and so many more.

I don't want to come down too hard on R. I still use it almost daily, and as a language for statistical analysis I think it's completely unparalleled. But as someone who also has to frequently write "real software" to get some data science related parts of my job done my chest gets tight when I realize a core part of what I need to write has to be done in R.

Expand full comment
Claus Wilke's avatar

I'm using R as comparison because that's the only alternative we have but ideally we'd have a language/framework that provides the best of both worlds. I don't need R to win. I just want my day-to-day work to be easier. And I want students in my office to be able to do data manipulations on the fly.

Expand full comment
Trevor Freeman's avatar

IMO Polars+Plotnine is the best of both worlds. I'm not really sure why we shove all the Python people into Pandas+matplotlib by default. I guess it's because those are common frameworks with a lot of momentum historically. But I totally agree, their APIs are a hindrance for data exploration.

Expand full comment
Daniel Morton's avatar

For straight data manipulation nothing beats SQL. At least the sort of data manipulation SQL is good at. Doing something in SQL is either easy or heinous, there's no middle ground.

If we add SQL to mix with Python and R I think I see a pattern. There seems to be tradeoff between being a good analytics language being a good general purpose language. SQL is great at the sort of data wrangling it's good at, but it's an old dog that can't learn new tricks. R is specifically designed for stats so it has better syntax for anything stats/DS related but is pretty useless for any other CS application. This, and R's infamous memory issues, are enough to hinder general adoption. Python is a general purpose dynamically typed language that is good enough, and fast enough, for mathematical calculations that a whole stats/DS ecosystem was written for it. Since Python is "good enough" at data applications and has applications outside the data world industry naturally gravitates towards it.

Until someone develops a language with R-like syntax and the ability to compete with Python outside the stats world, we're probably stuck with the snake.

Expand full comment
Rocks for Jocks's avatar

Agree R is definitely easier! Especially for first-time learners.

Curious how everyone got convinced that Python is easier.

It’s definitely better for certain tasks, such as machine learning or building web applications, but not easier.

Expand full comment
Trevor Freeman's avatar

I think this varies from person to person. Ironically I find that students who have a background in other programming languages really tend to struggle at picking up R.

Expand full comment
John Doe's avatar

The comparison goes both ways.

Python’s standard library covers logging, OOP, modules, etc., but base Python can’t do any data science at all.

Base R, on the other hand, has dataframes, stats, models, ... built in.

Both languages rely on packages and R have great packages for everything you listed. R and Python just have different things built in.

So to me your comment seems like a non-issue?

Expand full comment
Trevor Freeman's avatar

IMO it’s a lot easier to add DS packages on top of a language with a lot of features than it is to add packages for basic language features. But I guess that’s a matter of opinion.

Expand full comment
Daniel Morton's avatar

I switched from R to Python a decade ago and I've never really regretted it. R's memory issues are something a dealbreaker for me. The fact that it doesn't exist outside the stats/DS world is something of a problem as well.

That said, matplotlib is terrible. So much so that I'm not sure it's even using Python properly. There's really no reason Python can't have a data vis package more like ggplot. Except that matplotlib already exists and no one has the time to reinvent that particular wheel.

Sklearn has it's problems (I'm pretty sure the backend design violates several software architecture principles) but they don't affect daily life.

Pandas syntax is eccentric, and not in a good way. Numpy is okay, as is PyTorch, but they're their own ecosystems. so different rules apply.

And really I don't like Python that much. Something about it just feels cheap. It makes it far too easy to write bad, unstable code, and has spawned far too many bad coders. It's fine for mucking about, but for making something that lasts give me a good statically typed language any day.

Expand full comment
Marcos H's avatar

My plotting needs are very basic so I use plotly for almost everything. I tried out Altair lately when I needed a shaded error region which plotly doesn't support. I never got into ggplot so I use plotnine only occasionally.

Any of these are so much better than matplotlib - I am puzzled why anyone uses it too!

Expand full comment
gmdbioinformatics's avatar

I agree with the general sentiment - I also come from an R background, and struggled to move to python. I have a draft for a similar article somewhere, that I never published.

However, for machine learning, python is just better supported. I do a lot of Kaggle competitions, and it is very rare to see R notebooks there. I've tried h2o, tidymodels, and similar libraries, but they are so behind the Python counterparts. And they are also so much slower, and memory consuming, as R really struggles with big datasets.

Expand full comment
Claus Wilke's avatar

Not going to disagree, and in fact I believe I expressed a similar thought in my article.

Expand full comment
gmdbioinformatics's avatar

For example, in the code you posted:

```

library(tidyverse)

library(palmerpenguins)

penguins |>

filter(!is.na(body_mass_g)) |>

group_by(species, island) |>

summarize(

body_weight_mean = mean(body_mass_g),

body_weight_sd = sd(body_mass_g)

)

````

```

import pandas as pd

from palmerpenguins import load_penguins

penguins = load_penguins()

(penguins

.dropna(subset=['body_mass_g'])

.groupby(['species', 'island'])

.agg(

body_weight_mean=('body_mass_g', 'mean'),

body_weight_sd=('body_mass_g', 'std')

)

.reset_index()

)

```

- I like that R has a specific function for filtering (filter()). It seems more elegant than Python, which has a specific method just for removing NAs.

- I really like that you don't have to add quotes around column names, in R

- `group_by` and `summarise` are much better function names than `groupby` and `agg`.

- pandas indexes still confuse me all the time

That being said.... I tried many times using R notebooks in Kaggle, but it is just too slow.

And there is also the issue of installing packages - since there are no precompiled binaries for R packages on Linux, compiling R in a docker image takes ages.

Expand full comment
Claus Wilke's avatar

Posit provides binary packages for Linux, as far as I know. Most people use them for Github Actions on Linux. See for example this log:

https://github.com/tidyverse/ggplot2/actions/runs/19065695259/job/54455657142

It took about 3 min. to set up all dependencies for ggplot2, completely from scratch.

(Compiling all dependencies from source used to be a huge issue for continuous integration for Posit, that's why they started providing binary packages I believe. I still remember when ggplot2 checks could take an hour or so and most of it was compiling Rcpp and similar.)

Expand full comment
Jordi's avatar

I use R and python in Kaggle. Sometimes it is better to install pak.

Expand full comment
John Doe's avatar

See mlr3 which is arguably more powerful than sklearn

Expand full comment
gmdbioinformatics's avatar

Thanks, will give it a try.

Expand full comment
Oleh Omelchenko's avatar

Regarding dataviz options for Python - admittedly, matplotlib’s ubiquity is a great hindrance to the perceived variety of options for data visualization in Python (along with the fact that it is still the default option in many curricula). However, there are strong alternatives, among them plotnine (already mentioned in another comment) and Altair (a wrapper over Vega-Lite, which uses its own grammar of graphics dialect and is my tool of choice).

As for data wrangling, I’ll preface my remarks by admitting that neither R, nor Python and Pandas aren’t my language of choice for EDA (SQL is) - however, I’m not entirely convinced by the “logistics” argument. Comparing code for basic Python data wrangling will surely seem cumbersome versus R, but AFAIK tidyverse isn’t part of standard R either.

Expand full comment
Alex's avatar

R’s support for non-standard evaluation (NSE) is the biggest thing that makes it lower friction for interactive data analysis. The tidyverse packages use NSE all over the place. It enables you to use use column names in your code without quoting them in a string, and eliminates most of the repetitive “df” stuttering that is required in Pandas code.

Python doesn’t have language-level support for this, which is probably a correct decision for a general-purpose language. At the same time, it makes it impossible for Python packages to provide the same affordances for interactive use that you get in R.

Expand full comment
Eurydice's avatar

Thank you for writing this! The difference between Python and R is starker even than I expected, despite some familiarity with both. I'll have to try plotnine more often (or R, I suppose), because I'm absolutely more familiar with both matplotlib and seaborn. Looking forward to a post on deep learning and the advantages of PyTorch someday!

Expand full comment
Claus Wilke's avatar

PyTorch is the only game in town. There you go. Article done. ;-)

Expand full comment
Neural Foundry's avatar

Your point about the tediousness of Python for exploratory data analsis resonates with me. The tidyverse in R has been a game changer for interactive work. However I do think pandas has improved significantly, and tools like Polars are starting to bridge the gap. Looking forward to your next post detailing the specific limitatons.

Expand full comment
Tony's avatar

I’m mostly python and dabble in R. I think figures in R look great and make a bit more sense coding wise.

That said, my man coding experience is with python and using pandas/numpy/matplotlib/seaborn. They get the job done cause I’m more familiar with python, but there are times I get frustrated as making the visualization can take longer. I should definitely try plotnine (on my long list of todos)

Depending on your data and goals, I would also recommend SQL to manipulate tables. It can work very well with python, such as the installing the DuckDB python package or depending on your IDE, SQL can be baked into the IDE and quickly integrated for your workflow.

Expand full comment
Carl's avatar

Hooray for R! There are so many thinks I like about R, but since it wasn't mentioned, for data wrangling my money is still on the great `data.table` and for large data sets with tens to hundreds of million rows `polars` (available for R and Python) does wonders but is also really verbose and kind of clunky to write.

library(data.table)

library(palmerpenguins)

penguins_dt <- as.data.table(copy(penguins))

penguins_dt[

!is.na(body_mass_g),

.(

body_weight_mean = mean(body_mass_g),

body_weight_sd = sd(body_mass_g)

),

by = .(species, island)

]

Expand full comment
Ian Campbell's avatar

I’ve been a full time software engineer for nearly a decade and I recently finished a masters of data analytics and I have a few thoughts on this.

First, I could be mistaken, but you’ve actually said very little about the languages themselves as much as their ecosystems. Pandas isn’t Python. It may seem like splitting hairs and there are times that languages are defined by their ecosystems as much as their own design (such as the coupling of Ruby and Rails), but I think it’s worth mentioning because the libraries may change far faster than the language does.

My second point though, is that Python is the default language for anybody who is learning programming for the sake of working with data. This means that many Python programmers only know Python and are more familiar with Python libraries than anything else. They learn to interact with data in the ways which are easiest in their lingua franca and,as they only know how to use a hammer, everything is a nail.

In software engineering this phenomenon is very common with Java as well, since it’s the de facto teaching language (though Python is giving it a run for its money in many programs) and is commonly used in enterprise. It’s not that Java code is particularly bad because of Java (though some will fight to the death over that), often it’s because Java code is written by amateurs with little experience in other programming paradigms and working with other tools.

Expand full comment
Claus Wilke's avatar

This is going to be a series of essays. The next one will specifically talk about language features. The one after about libraries and ecosystems. Then I'll probably also write one about history, how did we get here, and one about future, where could we go from here.

Expand full comment
Aman Karunakaran's avatar

I don't necessarily disagree with the thrust of this article (I personally only think Python is worth it over R/Excel if your data analysis interacts with lots of other readily-available Python library code you've written), but I'm not sure I understand the point you're trying to make in comparing non-pandas Python to R. R was specifically designed for data analysis while Python is meant to be general purpose, so it makes sense that the data analysis snippet would look more cumbersome in raw Python?

Perhaps this is a set up for your future parts, as alluded to by

>In brief, I believe there are several reasons why Python code often devolves into dealing with data logistics.

in which case, sure, I'm curious to see what your future examples will be.

As an aside: thanks for the heads up on plotnine! I've been coding in Python for years and doing dataviz with seaborn/mpl, and I'm really surprised no one at my previous company used plotnine. A brief look of it seems really nice

Expand full comment
Claus Wilke's avatar

My primary goal was just to show what I mean by code that focuses on logistics. It was not meant as a comparison to R. (In general, I'm using comparisons to R primarily for specific examples of how things can be better. That doesn't mean R as a whole is better. It has its own issues.)

Expand full comment
Darby Saxbe's avatar

ok, now do SPSS

Expand full comment
Claus Wilke's avatar

$99 per month, haha. Amazed that there are still people around that pay this kind of money for software that has no unique capabilities not available otherwise for free.

https://www.ibm.com/products/spss-statistics#pricing

Expand full comment
Darby Saxbe's avatar

It’s terrible but it’s what I learned on, so it feels really intuitive to me. I’ve been trying to switch fully to R for years. (At least my university pays for it)

Expand full comment
eric's avatar

and in DuckDB, it's just SQL:

FROM penguins

SELECT species, island,

body_weight_mean: mean(body_mass_g),

body_weight_sd: stddev(body_mass_g)

GROUP BY ALL ;

Expand full comment
mmmmmm's avatar

I used R for many years and have been doing Python mostly for ~3 years. It's somewhat hard to compare because I'm still a stronger R programmer than a Python programmer but I truly believe that then R/tidyverse is so much more elegant. (If you are only doing data analysis and plotting, outside of that it's more complicated and Python is often the right answer).

But also: Tidyverse > Pandas > Base R

Expand full comment
Jonas's avatar

R + tidyverse is great and concise - but all your work is wasted, if you run into memory limits.

Python for non-ML data science stuff sucks because of matlibplot. However, “Seaborn Objects” fixes 95% of these issues (not to be confused with “Seaborn”, which is simply more matlibplot hell). Seaborn Objects is similar to ggplot2 in that it doesn’t force you to conform to some pre-defined structure - you just plot and change your plot as needed (it will handle the underlying data transformations). And therefore it is almost as concise as ggplot2, example:

(df

.pipe(so.Plot, x=“datetime”, y=“sales”)

.facet(“county”)

.add(so.Bar())

)

Python for non-ML data science stuff is held back by Pandas. Don’t get me wrong, Pandas is a great and useful library. But it is not concise, does not support lazy evaluations and like R it is limited by memory. You can seldomly comprehend the full dataset, so you end up using df.head(). You seldomly need to apply a transformation to the full data set to verify its correctness. But Pandas (and other libraries without lazy evaluations) nonetheless force you transform and store the full dataset in memory (wasting your precious memory and your time). PySpark is lazy and only applies the transformation to the 5 first rows, if you use df.transform(some_function).limit(5).display(). You postpone large/slow data transformations until the absolute last minute/until absolutely necessary.

PySpark (SQL) is not quite as concise as tidyverse, but close:

import spark.sql.functions as *

df = spark.read.csv(“penguins.csv”, header=True)

(df

.filter(isnotnan(“body_mass_g”))

.groupby(“species”, “island”)

.agg(

mean(“body_mass_g”).alias(“ body_weight_mean”),

std(“ body_weight_g”).alias(“ body_weight_sd”)

)

)

And you can combine PySpark and Seaborn Objects:

(df

.filter(isnotnan(“body_mass_g”))

.groupby(“species”, “island”)

.agg(

mean(“body_mass_g”).alias(“ body_weight_mean”),

std(“ body_weight_g”).alias(“ body_weight_sd”)

)

.toPandas()

.pipe(so.Plot, x=“species”)

.pair(y=[“body_mass_g”, “ body_weight_g”])

.facet(“island”)

.add(so.Dot())

)

Matlibplot and Pandas are useful because of their wide support, but in my opinion their API/syntax slows down non-ML work needlessly. I don’t think I would have discovered this, had it not been for R + tidyverse.

Expand full comment
Hernan's avatar

Agree with the general sentiment. To be completely fair, we could compare the code needed to do this task in base r, using aggregate(), for instance.

Expand full comment
Hernan's avatar

oops! Verzeihung!

Expand full comment