Creating reproducible data analysis pipelines

Mar 27

There was a discussion recently on Bluesky about reproducible data analysis pipelines.

5 Comments

Quick comment about notebook reproducibility: I am a very recent convert to Marimo notebooks, which encode dependencies between cells and avoid this problem. Also very useful for beautiful and interactive results that can be viewed as a slide deck -- a hybrid between a notebook and a dashboard. Highly recommend you check it out!

Luca Venturini

Mar 28

I've written analysis pipelines for multiple research centres and a clinical company. Trying to have everything in a single program or docker image, with no intermediate files, is pure madness. Aside from forcing to restart from scratch (an afternoon being a very short time for certain analytical tasks, and setting aside the raw compute cost), this forces the developer to keep all dependencies together -- it can very quickly become dependency hell.

Breaking down the pipeline in small, discrete chunks and orchestrate it with modern tools such as Cromwell or nextflow is a much better option. Upon rerunning, these workflow orchestrators are smart enough to understand which parts of the workflow have changed (data or programs used or initial parameters) and rerun only the necessary parts.

Mar 28Edited

It’s been a while, but I used to handle this using Makefiles. Each step of the pipeline would be a separate script, producing an intermediate dataset or graph or table. But after any change the full pipeline would be run for any dependent steps.

ScienceGrump

Mar 28

X = my_expensive_computation()

X.to_csv("data/X.csv")

X = pd.read_csv("data/X.csv")

Trivial to do after anything expensive. Comment the first two lines, go on your merry way. I don't even understand what the argument against this would be, but I think it could only make sense to someone with extremely limited experience.

Novices fail to understand is how critical speed is to analysis - really to any technical skill. You need to be able to look at the data again and again in many different ways. Therefore your workflow has to support frictionless data manipulation. This is why the people using AI for analysis are fools IMO. It is intrinsically calcifying to have a machine making your code. It will make many dumb decisions and never review them, and just the fact of having it write your code for you creates a barrier for you to review them. And bluntly, if it is faster than you, you are just too slow.

Clayton Nash

Apr 1

This is really two different problems being solved which is why you're ending up with diametrically opposed views. The "end to end" problem being solved is answering the question "When we did this 9 months ago we got a different answer - what's changed" and there having end to end REPRODUCABLE process is super important because you need to be able to trivially rerun the old result. This is a business issue, not a data science issue - but it's very common and the kind of thing a data scientist needs to be aware of.

For students, when the goal is to get them to understand the tools and processes, this is less key. But they should understand at some point they will be asked the question above.

Genes, Minds, Machines

Creating reproducible data analysis pipelines