FilterHN

Jacquard lab notebook: Version control and provenance for empirical research

98 points

by surprisetalk

10 months ago

| past

| 8 comments

| inkandswitch.com

| HN

▲

bluenose69

10 months ago

[-]

I'm sure this will be useful for some folks, but I'll stick with 'git' to track changes and 'make' to organize my work.

It seems as though this project is a sort of system for creating Makefiles, and that would be great for folks who are unfamiliar with them.

I'm not sure of the audience, though. At least in my research area, there are mainly two groups.

1. People who use latex (and scripting languages) are comfortable with writing Makefiles. 2. People who work in excel (etc) and incorporate results into msword.

Neither group seems a likely candidate for this (admittedly intriguing) software.

▲

conformist

10 months ago

[-]

There are many people in group 1 in academia eg in physics and maths that are comfortable with latex and scripting languages but mostly use email to share files. Anything that helps them organise their collaborative work better without having to deal with git helps (eg see eg success of overleaf).

▲

ska

10 months ago

[-]

Part of the problem is that git is a fairly poor fit for these workflows.

I spent time getting some mathematicians working together via version control rather than email, it was a bit of a mixed bag even using something simpler (e.g. svn). Eventual we moved back to email, except the rule was email me your update as a reply to the version you edited, and I scripted something to put it all into a repo on my end to manage merges etc. Worked ok. Better than the version where we locked access for edit but people forgot to unlock and went off to a conference...

If I was doing the same now, I'd probably set up on github, give each person a branch off main, and give them scripts for "send my changes" and "update other changes" - then manage all the merges behind the scenes for anyone who didn't want to bother.

I think expecting everyone in a working group to acquire the skills to deal with merge issues properly etc. is too far if they don't do any significant software work already. In the latter case., teach them.

▲

__MatrixMan__

10 months ago

[-]

It's easy to collect and verify metadata involving the hashes of intermediate artifacts such that readers can observe it and trust that the charts correspond with the data because they trust whoever published the metadata. This could be automatic, just part of the reader.

The trouble with make is that unless you're very disciplined or very lucky, if you build the images and documents on your machine and I do the same on mine, we're going to get artifacts that look similar but hash differently, if for no other reason than that there's a timestamp showing up somewhere and throwing it off (though often for more concerning reasons involving the versions of whatever your Makefile is calling).

That prevents any kind of automated consensus finding about the functional dependency between the artifacts. Now reviewers must rebuild the artifacts themselves and compare the outputs visually before they can be assured that the data and visualizations are indeed connected.

So if we want to get to a place where papers can be more readily trusted--a place where the parts of research that can be replicated automatically, are replicated automatically, then we're going to need something that provides a bit more structure than make (something like nix, with a front end like Jacquard lab notebook).

The idea that we could take some verifiable computational step and represent it in a UI such that the status of that verification is accessible, rather than treating the makefile as an authoritative black box... I think it's rather exciting. Even if I don't really care about the UI so much, having the computational structure be accessible is important.

▲

XorNot

10 months ago

[-]

Here's the thing though: you're trying to solve a problem here which doesn't exist.

In physical science, no one commits academic fraud by manipulating a difference between the graphs they publish and the data they collected...they just enter bad data to start with. Or apply extremely invalid statistical methods or the like.

You can't fix this by trying to attest the data pipeline.

▲

__MatrixMan__

10 months ago

[-]

I'm not really trying to address fraud. Most of the time when I try to recreate a computational result from a paper, things go poorly. I want to fix that.

Recently I found one where the authors must've mislabeled something because the data for mutant A actually corresponded with the plot for mutant B.

Other times it'll take days of tinkering just to figure out which versions of the dependencies are necessary to make it work at all.

None of that sort of sleuthing should've required a human in the loop at all. I should be able to hold one thing constant (be it the data or the pipeline), change the other, and rebuild the paper to determine whether the conclusions are invariant to the change I made.

Human patience for applying skepticism to complex things is scarce. I want to automate as much of replication as possible so that what skepticism is available is applied more effectively. It would just be a nicer world to live in.

▲

jpeloquin

10 months ago

[-]

Even in group 1, when I go back to a project that I haven't worked on in years, it would be helpful to be able to query the build system to list the dependencies of a particular artifact, including data dependencies. I.e., reverse dependency lookup. Also list which files could change as a consequence of changing another artifact. And return results based on what the build actually did, not just the rules as specified. I think make can't do this because it has no ability to hash & cache results. Newer build systems like Bazel, Please, and Pants should be able to do this but I haven't used them much yet.

▲

cashsterling

10 months ago

[-]

I follow the work of the ink & switch folks... they have a lot of interesting ideas around academic research management and academic publishing.

I have a day job, but spend a lot of thought about ways to improve academic/technical publishing in the modern era. There are a lot problems with our current academic publishing model: a lot of pay-walled articles / limited public access to research, many articles have no/limited access to the raw data or analytical code, articles don't make use of modern technology to enhance communication (interactive plots, animations, CAD files, video, etc.).

Top level academic journals are trying to raise the bar on research publication standards (partially to avoid the embarrassment of publishing fraudulent research) but they are all stuck not want to kill the golden goose. Academic publishing is a multi-billion dollar affair and making research open, etc. would damage their revenue model.

We need a GitHub for Science... not in the sense of Microsoft owning a publishing platform but in the sense of what GitHub provides for computer science; a platform for public collaboration on code and computer science ideas. We need a federated, open platform for managing experiments and data (i.e. an electronic lab notebook) and communicating research to the public (via code, animations, plots, written text in Typst/LaTeX/Markdown, video, audio, presentations, etc. Ideally this platform would also have an associated discussion forum for discussion and feedback on research.

▲

LowkeyPlatypus

10 months ago

[-]

The idea sounds great! However, I see some potential issues. First, IIUC using this tool means that researchers will have to edit their code within it, which may be fine for small edits, but for larger changes, most people would prefer to rely on their favourite IDE. Moreover, if the scripts take a long time to run, this could be problematic and slow down workflows. So, I think this “notebook” could be excellent for projects with a small amount of code, but it may be less suitable for larger projects.

Anyway, it’s a really cool project, and I’m looking forward to seeing how it grows.

▲

Tachyooon

10 months ago

[-]

I had the same thought - researchers who are used to having their workflows in VS Code, for example, could be missing out on a lot of tools that they are used to. I'm their description they talk about how they want to meet researchers where they're at, "building bridges" to existing workflows and software. So I'm hopeful that they will consider integrating with popular programming and data analysis set-ups. The project seems to be just getting started so it'll be interesting to see where this goes :)

▲

karencarits

10 months ago

[-]

Coming from R, I would recommend researchers to have a look at Quarto [1] and packages such as workflowr [2] that also aim to ensure a tight and reproducible pipeline from raw data to the finished paper

[1] https://quarto.org/docs/manuscripts/authoring/rstudio.html

[2] https://workflowr.io/

▲

data_maan

10 months ago

[-]

Behind all the technical lingo, what problem does this solve that cannot be solved by sticking to a git repo that tracks your research and using some simple actions on top of GitHub for visualization etc.?

▲

throwpoaster

10 months ago

[-]

Remember the famous HN comment:

“This ‘Dropbox’ project of yours looks neat, but why wouldn’t people just use ftp and rsync?”

▲

scherlock

10 months ago

[-]

The fact that software engineers are the only folks with the skills to do what you just said.

When I was working on PhD thesis 20 years ago, I had a giant makefile that generated my graphs and tables then generated the thesis from LaTeX.

All of it was in version control, made it so much easier, but no way anyone other than someone that uses those tools would be able figure it out.

▲

exe34

10 months ago

[-]

> The fact that software engineers are the only folks with the skills to do what you just said.

I've always been impressed by the amount of effort that people are willing to put in to avoid using version control. I used mercury about 18 years ago, and then moved to git when that took off, and I never write much text for work or leisure without putting it in git. I don't even use branches at all outside of work - it's just so that the previous versions are always available. This applies to my reading notes, travel plans, budgeting, etc.

▲

Tachyooon

10 months ago

[-]

Version control is fantastic, and you can get quite creative with it too. Git scraping for example (https://simonwillison.net/2021/Dec/7/git-history/). But as nice as Git is, people who are not trained to be a software developer or computer scientist often don't have a lot of exposure to it, and when they do it's a relatively big step to learn to use it. In my mechanical engineering studies we had to do quite a bit of programming, but none of my group mates ever wanted to use version control, not even on bigger projects. The Jacquard notebook and other Ink&Switch projects are aimed at people with non-software backgrounds, which is quite nice to see :)

▲

ska

10 months ago

[-]

Oh, they all use version control.

It just looks like "conf_paper1.tex" "conf_paper3.tex" "conf_paper_friday.tex" "conf_paper_20240907.tex" "conf_paper_last_version.tex" "conf_paper_final.tex"

...

"conf_paper_final2.tex"

Oh, and the figures reference files on local dir structure.

And the actual, eventually published version, only exists in email back and forth with publisher for style files etc.

▲

RhysabOweyn

10 months ago

[-]

I once worked with a professor and some graduate students who insisted on using box as a code repository since it kept a log of changes to files under a folder. I tried to convince them to switch to git by making a set of tutorial videos explaining the basics but it was still not enough to convince them to switch.

▲

svnt

10 months ago

[-]

When github started, for most people the only purpose was just so you didn’t have to manage a server holding your repository. To avoid using it at that point for private projects required all of ssh and a $5/mo virtual machine somewhere, and all of their customers could follow the steps to set that up. It still succeeded.

▲

sega_sai

10 months ago

[-]

That is actually a very interesting idea. While I am not necessarily interested in some sort of build system for a paper, but being able to figure out which plots need to be regenerated when some data file or some equation is changed is useful. For this, being able to encode the version of the script and all the data files used in creating of the plot would be valuable.

▲

kleton

10 months ago

[-]

Given the replication crisis in the sciences, objectively this is probably a good thing, but the incumbents in the field would strongly push back against it becoming a norm.

https://en.wikipedia.org/wiki/Replication_crisis

▲

ska

10 months ago

[-]

This addresses a nearly orthogonal issue.

▲

idiotlogical

10 months ago

[-]

Am I not smart, or what about the "Subscribe" page won't allow me to get past the "Name" field? I tried a few combos and even an email addresses and it doesn't validate:

https://buttondown.com/inkandswitch

▲

pvh

10 months ago

[-]

Hey, thanks! I don't know what's regressed here but I've emailed the support people for buttondown.