Discussion about this post

User's avatar
Josh's avatar

Quick comment about notebook reproducibility: I am a very recent convert to Marimo notebooks, which encode dependencies between cells and avoid this problem. Also very useful for beautiful and interactive results that can be viewed as a slide deck -- a hybrid between a notebook and a dashboard. Highly recommend you check it out!

Luca Venturini's avatar

I've written analysis pipelines for multiple research centres and a clinical company. Trying to have everything in a single program or docker image, with no intermediate files, is pure madness. Aside from forcing to restart from scratch (an afternoon being a very short time for certain analytical tasks, and setting aside the raw compute cost), this forces the developer to keep all dependencies together -- it can very quickly become dependency hell.

Breaking down the pipeline in small, discrete chunks and orchestrate it with modern tools such as Cromwell or nextflow is a much better option. Upon rerunning, these workflow orchestrators are smart enough to understand which parts of the workflow have changed (data or programs used or initial parameters) and rerun only the necessary parts.

3 more comments...

No posts

Ready for more?