FilterHN

Representing Python notebooks as dataflow graphs

104 points

by akshayka

5 days ago

| past

| 10 comments

| marimo.io

| HN

▲

data-ottawa

1 day ago

[-]

I've been using marimo since January pretty heavily, I absolutely love it and would recommend it to anyone.

I run it with uv and --sandboxed which makes it much easy to share notebooks with teammates and not have to worry about limiting dependencies. Any issues I've had were were Python libraries themselves (specifically graphviz).

I really like how much easier it is to reason about interactive components vs Jupyter. The mo.ui.altair_chart method has got me to migrate off of matplotlib because charts can be fully integrated – as you can see in the demo being able to lasso data points or scrub a chart and analyze specific time periods is awesome.

One thing which I don't like about reactive notebooks is that you have to be much more mindful of expensive and long running calculations. There are feature to help, like adding a run button, but often I end up just disabling auto-run which does reduce the value of the reactive flow. For those use cases I don't find myself using marimo over Jupyter.

I think the entire marimo team deserves a shoutout, the quality of the software is excellent, they've moved very quickly, and they have been very receptive to issues and feature suggestions.

▲

akshayka

1 day ago

[-]

Thanks for the shoutout!

We're committed to having an excellent experience for working with expensive notebooks [1]. At least for my own personal work, I find that there are many reasons to use marimo even when autorun is disabled — you still get guarantees on state, rich dataframe views, reusable functions [2], the Python file format, and more. If you have feedback on how we might improve the experience, we'd love to hear it.

[1] https://docs.marimo.io/guides/expensive_notebooks/

[2] https://docs.marimo.io/guides/reusing_functions/

▲

alyxya

9 hours ago

[-]

I haven’t tried marimo, so I’m not sure how it currently works, but I think instead of disabling autorun for slow and expensive computations, it sounds like it would be nicer if there were heuristics or benchmarks to automatically determine what might be slow and then execute things in varying orders lazily.

▲

getnormality

1 day ago

[-]

> You have to be very disciplined to make a Jupyter notebook that is actually reproducible

This seems not necessarily very hard to me? All you have to do is keep yourself honest by actually trying to reproduce the results of the notebook when you're done:

1. Copy the notebook

2. Run from first cell in the copy

3. Check that the results are the same

4. If not the same, debug and repeat

What makes it hard is when the feedback loop is slow because the data is big. But not all data is big!

Another thing that might make it hard is if your execution is so chaotic that debugging is impossible because what you did and what you think you did bear no resemblance. But personally I wouldn't define rising above that state as incredible discipline. For people who suffer from that issue, I think the best help would be a command history similar to that provided by RStudio.

All that said, Marimo seems great and I agree notebooks are dangerous if their results are trusted equally as fully explicit processing pipelines.

▲

tastyminerals2

1 day ago

[-]

Not very hard to you, however the reproducibility numbers tell a different story. Back in the days, when we were searching for some ML model implementations in the public repos and found ipynb files in it, we skipped the repo without delving into details. Within the company data engineer research notebooks were never allowed inside a repo. Experiment, yes, but rewrite it in plain python and push.

▲

getnormality

1 day ago

[-]

A lot of people don't put away shopping carts, but the conclusion from that isn't that putting shopping carts away requires very high discipline. (Maybe if what is meant by "very high" is "not so low that everyone will do it", which is perhaps the point)

▲

esafak

1 day ago

[-]

Notebooks ought to have embedded metadata, like a pyproject.toml, to list the dependencies.

▲

kylebarron

1 day ago

[-]

https://github.com/manzt/juv

▲

esafak

1 day ago

[-]

Good. This is the part I'm talking about: https://peps.python.org/pep-0723/

▲

philsnow

1 day ago

[-]

> This seems not necessarily very hard to me? All you have to do is keep yourself honest by actually trying to reproduce the results of the notebook when you're done

It's one thing when I'm relying on my own attention to detail to make sure all the intermediate results have been correctly recalculated, but it's entirely another when I have to rely on even trusted co-workers' attention to detail, much less randos on github. As a sibling comment points out, the "reproducibility crisis" numbers are very much not in favor of this approach being the right idea.

... Or you could work in a format that makes incorrect / out-of-date intermediate state impossible (or at least hard) to represent, which is (I believe) what marimo is an attempt at.

▲

exe34

1 day ago

[-]

The few times I've made notebooks, I've tried to migrate code out of the notebook as soon as possible and then only import foo and run foo.bar() in the notebook. It helps to only have the top level config/layout in the notebook.

▲

cantdutchthis

1 day ago

[-]

Fun detail, you can actually define functions in a marimo notebook and load them in another Python file if you want.

Needs a bit of extra config but tis a really nice feature.

https://docs.marimo.io/guides/reusing_functions/

▲

Galanwe

1 day ago

[-]

There are a lot of these tools to somehow "fix the reproducibility crisis of notebooks".

Yet from my experience, you quickly learn to "restart kernel and run all" before sharing to make sure everything is good.

All but the most novice users get caught by the "out of order cells" trap, and those will

1) not use anything that adds complexity, because by definition they are novices

2) fall in any other trap on their way because anyway that's how you learn

Thus, IMHO, these flow tools are only seen as useful by _real devs with savior syndrome_, pushing dev solution to exploratory research users, and that will never catch on.

▲

ayhanfuat

1 day ago

[-]

I like marimo and I would probably use it if it weren't for years of muscle memory. That said, I don't like this reproducibility crisis story either. Notebooks are for exploration. It is okay if they are messy. If the tool I am using doesn't get in the way of my process but instead makes it fast enough then it is already doing its job. Once you are done it is up to you to let it die, make sure it is something you can go back and iterate on it, or package it and make it usable elsewhere.

▲

akshayka

1 day ago

[-]

What kind of muscle memory is holding you back? We recently added support for Jupyter-style command mode in keyboard shortcuts [1]. We're currently rewriting our VS Code extension to feel native, similar to how Jupyter feels in VS Code.

Anything else we can help with?

▲

ayhanfuat

1 day ago

[-]

Ah that's great to hear. For me it is mostly the command mode. Is it only create / copy / paste for now? Can I also do the same for move / split / delete / undelete?

▲

akshayka

7 hours ago

[-]

Sorry I forgot the link. We have shortcuts for those as well. If any are missing please file an issue and we can consider adding them.

I forgot the link: https://docs.marimo.io/guides/editor_features/overview/#conf...

▲

analog31

1 day ago

[-]

I compare it to day trading, where you always close out your position at the end of the day, and never leave anything running overnight. "Restart kernel and run all" when I finish a lab session. Assume that I might not come back to it for a long time, and that it won't be fresh in my mind.

It's harder when some of your cells took hours to execute, which mine don't. Then I think you should be using data files.

The tools that fix this problem seem to evoke "the cure is worse than the disease." At this point, basic Jupyter notebooks are by far the most reproducible way I've ever worked.

▲

akshayka

1 day ago

[-]

Thanks for the comments. I'm the original creator of marimo.

Habitually running restart and run all works okay for very lightweight notebooks, but it's a habit you need to develop, and I believe our tools should work by default. It doesn't work at all for entire categories of work, where computation is heavy and the cost of a bug is high.

From the blog, you will see that reactive execution not only minimizes hidden state, it also enables rapid data exploration (far more rapid than a traditional notebook), reuse as data apps, reuse as scripts, a far more intelligent module autoreloader, and much more.

marimo is not just another Jupyter extension, it's a new kind of notebook. While it may not be for you, marimo has been open source for over a year and has strong traction at many companies and universities, including by many who you may not view to be "real devs". The question of whether marimo will catch on has already been resolved :)

https://github.com/marimo-team/marimo

▲

suuuuuuuu

1 day ago

[-]

I would consider replacing my jupyterlab usage with marimo were it less opinionated about workflow - it offers a lot of benefits that aren't tied to its execution model. I like the editor/interface and the representation as python files for portability, version control, and the ability to import from other notebooks, but I have no interest in changing my workflow (in particular insofar as marimo is restricted compared to python itself). E.g., I want to be able to redefine variables and use star imports in my personal, exploratory notebooks, and I'm happy to retain responsibility for top-to-bottom executability (as in regular python scripts). I would definitely consider marimo if these restrictions could be opted out of if one has reactive execution disabled.

▲

akshayka

7 hours ago

[-]

Thanks for the feedback. We decided early on against having a “non-reactive” mode. It would negate many of our core benefits (including importing from other notebooks), and it would also lead to a fragmented ecosystem — if someone shared a notebook with you, your experience with it would depend on whether it was executed in “reactive” or “non-reactive” mode. Still I appreciate the kind words about our editor and file format, and am sorry we can’t accommodate your use case.

We describe why we opted against “disabling” the graph at the end of this blog: https://marimo.io/blog/lessons-learned

▲

nylonstrung

1 day ago

[-]

Marimo seems really solid if you like tools like Streamlit or Observable

▲

cantdutchthis

1 day ago

[-]

It certainly has some of the “widget feelings” from streamlit but the real killer feature is that you’re still always in a notebook. You can still explore with these widgets, which is a stellar experience.

▲

tastyminerals2

1 day ago

[-]

Personally, I had good experience with marimos so far. Reactive execution, variable deduplication, clear business logic vs UI elements logic separation that is forced on you is good. It retrains ppl to write slightly better structured Python code which is a win in my eyes.

▲

riedel

1 day ago

[-]

Even with data flow extension (also like ipyflow [0] ) I am still struggling with the execution model of notebooks in general. I often still see people defining functions and classes in notebooks to somehow handle prototyping loops.

I would love to see DAGs like in SSA form of compilers, that also supports loop operators. However, IMHO also the notebook interface needs to adjust for that (cell indentation ?). However, the strength of notebooks rather shows in document authoring like quarto, which IMHO mostly contradicts more complex controll flow.

[0] https://github.com/ipyflow/ipyflow

▲

philsnow

1 day ago

[-]

As an outsider to the whole notebook ecosystem, I am absolutely gobsmacked that the representation of the notebook makes it possible to have out-of-date intermediate results. Haven't they been around for like 10+ years?

This is one of those things that is blindingly obvious to people in adjacent sectors of the industry, but maybe there just hasn't been enough cross-pollination of ideas in that direction (or in either direction).

▲

createaccount99

18 hours ago

[-]

Based on the title, I was expecting a flow editor GUI, actually.

▲

PeterStuer

1 day ago

[-]

Would you not need "volatile" markup for anything touching a python external system?

▲

probablypower