Genes, Minds, Machines

I'm using R as comparison because that's the only alternative we have but ideally we'd have a language/framework that provides the best of both worlds. I don't need R to win. I just want my day-to-day work to be easier. And I want students in my office to be able to do data manipulations on the fly.

Reply (3)

Trevor Freeman

IMO Polars+Plotnine is the best of both worlds. I'm not really sure why we shove all the Python people into Pandas+matplotlib by default. I guess it's because those are common frameworks with a lot of momentum historically. But I totally agree, their APIs are a hindrance for data exploration.

Daniel Morton

For straight data manipulation nothing beats SQL. At least the sort of data manipulation SQL is good at. Doing something in SQL is either easy or heinous, there's no middle ground.

If we add SQL to mix with Python and R I think I see a pattern. There seems to be tradeoff between being a good analytics language being a good general purpose language. SQL is great at the sort of data wrangling it's good at, but it's an old dog that can't learn new tricks. R is specifically designed for stats so it has better syntax for anything stats/DS related but is pretty useless for any other CS application. This, and R's infamous memory issues, are enough to hinder general adoption. Python is a general purpose dynamically typed language that is good enough, and fast enough, for mathematical calculations that a whole stats/DS ecosystem was written for it. Since Python is "good enough" at data applications and has applications outside the data world industry naturally gravitates towards it.

Until someone develops a language with R-like syntax and the ability to compete with Python outside the stats world, we're probably stuck with the snake.

Rocks for Jocks

Agree R is definitely easier! Especially for first-time learners.

Curious how everyone got convinced that Python is easier.

It’s definitely better for certain tasks, such as machine learning or building web applications, but not easier.

Trevor Freeman

Nov 19

I think this varies from person to person. Ironically I find that students who have a background in other programming languages really tend to struggle at picking up R.

John Doe

Nov 19Edited

The comparison goes both ways.

Python’s standard library covers logging, OOP, modules, etc., but base Python can’t do any data science at all.

Base R, on the other hand, has dataframes, stats, models, ... built in.

Both languages rely on packages and R have great packages for everything you listed. R and Python just have different things built in.

So to me your comment seems like a non-issue?

Trevor Freeman

Nov 19

IMO it’s a lot easier to add DS packages on top of a language with a lot of features than it is to add packages for basic language features. But I guess that’s a matter of opinion.

Daniel Morton

I switched from R to Python a decade ago and I've never really regretted it. R's memory issues are something a dealbreaker for me. The fact that it doesn't exist outside the stats/DS world is something of a problem as well.

That said, matplotlib is terrible. So much so that I'm not sure it's even using Python properly. There's really no reason Python can't have a data vis package more like ggplot. Except that matplotlib already exists and no one has the time to reinvent that particular wheel.

Sklearn has it's problems (I'm pretty sure the backend design violates several software architecture principles) but they don't affect daily life.

Pandas syntax is eccentric, and not in a good way. Numpy is okay, as is PyTorch, but they're their own ecosystems. so different rules apply.

And really I don't like Python that much. Something about it just feels cheap. It makes it far too easy to write bad, unstable code, and has spawned far too many bad coders. It's fine for mucking about, but for making something that lasts give me a good statically typed language any day.

Marcos H

Nov 17

My plotting needs are very basic so I use plotly for almost everything. I tried out Altair lately when I needed a shaded error region which plotly doesn't support. I never got into ggplot so I use plotnine only occasionally.

Any of these are so much better than matplotlib - I am puzzled why anyone uses it too!

gmdbioinformatics

I agree with the general sentiment - I also come from an R background, and struggled to move to python. I have a draft for a similar article somewhere, that I never published.

However, for machine learning, python is just better supported. I do a lot of Kaggle competitions, and it is very rare to see R notebooks there. I've tried h2o, tidymodels, and similar libraries, but they are so behind the Python counterparts. And they are also so much slower, and memory consuming, as R really struggles with big datasets.

Reply (2)

Not going to disagree, and in fact I believe I expressed a similar thought in my article.

gmdbioinformatics

For example, in the code you posted:

```

library(tidyverse)

library(palmerpenguins)

penguins |>

filter(!is.na(body_mass_g)) |>

group_by(species, island) |>

summarize(

body_weight_mean = mean(body_mass_g),

body_weight_sd = sd(body_mass_g)

)

````

```

import pandas as pd

from palmerpenguins import load_penguins

penguins = load_penguins()

(penguins

.dropna(subset=['body_mass_g'])

.groupby(['species', 'island'])

.agg(

body_weight_mean=('body_mass_g', 'mean'),

body_weight_sd=('body_mass_g', 'std')

)

.reset_index()

)

```

- I like that R has a specific function for filtering (filter()). It seems more elegant than Python, which has a specific method just for removing NAs.

- I really like that you don't have to add quotes around column names, in R

- `group_by` and `summarise` are much better function names than `groupby` and `agg`.

- pandas indexes still confuse me all the time

That being said.... I tried many times using R notebooks in Kaggle, but it is just too slow.

And there is also the issue of installing packages - since there are no precompiled binaries for R packages on Linux, compiling R in a docker image takes ages.

Reply (2)

https://github.com/tidyverse/ggplot2/actions/runs/19065695259/job/54455657142

Posit provides binary packages for Linux, as far as I know. Most people use them for Github Actions on Linux. See for example this log:

It took about 3 min. to set up all dependencies for ggplot2, completely from scratch.

(Compiling all dependencies from source used to be a huge issue for continuous integration for Posit, that's why they started providing binary packages I believe. I still remember when ggplot2 checks could take an hour or so and most of it was compiling Rcpp and similar.)

Jordi

Nov 16Edited

I use R and python in Kaggle. Sometimes it is better to install pak.

John Doe

Nov 19Edited

See mlr3 which is arguably more powerful than sklearn

gmdbioinformatics

Nov 20

Thanks, will give it a try.

Carl

Hooray for R! There are so many thinks I like about R, but since it wasn't mentioned, for data wrangling my money is still on the great `data.table` and for large data sets with tens to hundreds of million rows `polars` (available for R and Python) does wonders but is also really verbose and kind of clunky to write.

library(data.table)

library(palmerpenguins)

penguins_dt <- as.data.table(copy(penguins))

penguins_dt[

!is.na(body_mass_g),

body_weight_mean = mean(body_mass_g),

body_weight_sd = sd(body_mass_g)

by = .(species, island)

]

Oleh Omelchenko

Regarding dataviz options for Python - admittedly, matplotlib’s ubiquity is a great hindrance to the perceived variety of options for data visualization in Python (along with the fact that it is still the default option in many curricula). However, there are strong alternatives, among them plotnine (already mentioned in another comment) and Altair (a wrapper over Vega-Lite, which uses its own grammar of graphics dialect and is my tool of choice).

As for data wrangling, I’ll preface my remarks by admitting that neither R, nor Python and Pandas aren’t my language of choice for EDA (SQL is) - however, I’m not entirely convinced by the “logistics” argument. Comparing code for basic Python data wrangling will surely seem cumbersome versus R, but AFAIK tidyverse isn’t part of standard R either.

Alex

Nov 16

R’s support for non-standard evaluation (NSE) is the biggest thing that makes it lower friction for interactive data analysis. The tidyverse packages use NSE all over the place. It enables you to use use column names in your code without quoting them in a string, and eliminates most of the repetitive “df” stuttering that is required in Pandas code.

Python doesn’t have language-level support for this, which is probably a correct decision for a general-purpose language. At the same time, it makes it impossible for Python packages to provide the same affordances for interactive use that you get in R.

Eurydice

Thank you for writing this! The difference between Python and R is starker even than I expected, despite some familiarity with both. I'll have to try plotnine more often (or R, I suppose), because I'm absolutely more familiar with both matplotlib and seaborn. Looking forward to a post on deep learning and the advantages of PyTorch someday!

PyTorch is the only game in town. There you go. Article done. ;-)

Neural Foundry

Your point about the tediousness of Python for exploratory data analsis resonates with me. The tidyverse in R has been a game changer for interactive work. However I do think pandas has improved significantly, and tools like Polars are starting to bridge the gap. Looking forward to your next post detailing the specific limitatons.

Tony

I’m mostly python and dabble in R. I think figures in R look great and make a bit more sense coding wise.

That said, my man coding experience is with python and using pandas/numpy/matplotlib/seaborn. They get the job done cause I’m more familiar with python, but there are times I get frustrated as making the visualization can take longer. I should definitely try plotnine (on my long list of todos)

Depending on your data and goals, I would also recommend SQL to manipulate tables. It can work very well with python, such as the installing the DuckDB python package or depending on your IDE, SQL can be baked into the IDE and quickly integrated for your workflow.

Ian Campbell

I’ve been a full time software engineer for nearly a decade and I recently finished a masters of data analytics and I have a few thoughts on this.

First, I could be mistaken, but you’ve actually said very little about the languages themselves as much as their ecosystems. Pandas isn’t Python. It may seem like splitting hairs and there are times that languages are defined by their ecosystems as much as their own design (such as the coupling of Ruby and Rails), but I think it’s worth mentioning because the libraries may change far faster than the language does.

My second point though, is that Python is the default language for anybody who is learning programming for the sake of working with data. This means that many Python programmers only know Python and are more familiar with Python libraries than anything else. They learn to interact with data in the ways which are easiest in their lingua franca and,as they only know how to use a hammer, everything is a nail.

In software engineering this phenomenon is very common with Java as well, since it’s the de facto teaching language (though Python is giving it a run for its money in many programs) and is commonly used in enterprise. It’s not that Java code is particularly bad because of Java (though some will fight to the death over that), often it’s because Java code is written by amateurs with little experience in other programming paradigms and working with other tools.

This is going to be a series of essays. The next one will specifically talk about language features. The one after about libraries and ecosystems. Then I'll probably also write one about history, how did we get here, and one about future, where could we go from here.

Aman Karunakaran

Nov 13Edited

I don't necessarily disagree with the thrust of this article (I personally only think Python is worth it over R/Excel if your data analysis interacts with lots of other readily-available Python library code you've written), but I'm not sure I understand the point you're trying to make in comparing non-pandas Python to R. R was specifically designed for data analysis while Python is meant to be general purpose, so it makes sense that the data analysis snippet would look more cumbersome in raw Python?

Perhaps this is a set up for your future parts, as alluded to by

>In brief, I believe there are several reasons why Python code often devolves into dealing with data logistics.

in which case, sure, I'm curious to see what your future examples will be.

As an aside: thanks for the heads up on plotnine! I've been coding in Python for years and doing dataviz with seaborn/mpl, and I'm really surprised no one at my previous company used plotnine. A brief look of it seems really nice

https://www.ibm.com/products/spss-statistics#pricing

My primary goal was just to show what I mean by code that focuses on logistics. It was not meant as a comparison to R. (In general, I'm using comparisons to R primarily for specific examples of how things can be better. That doesn't mean R as a whole is better. It has its own issues.)

ok, now do SPSS

$99 per month, haha. Amazed that there are still people around that pay this kind of money for software that has no unique capabilities not available otherwise for free.

Darby Saxbe