Build durable workflows with Postgres
156 points
2 days ago
| 15 comments
| dbos.dev
| HN
cmdtab
2 days ago
[-]
Recently moved some of the background jobs from graphile worker to DBOS. Really recommend for the simplicity. Took me half an hour.

I evaluated temporal, trigger, cloudflare workflows (highly not recommended), etc and this was the easiest to implement incrementally. Didn't need to change our infrastructure at all. Just plugged the worker where I had graphile worker.

The hosted service UX and frontend can use a lot of work though but it's not necessary for someone to use. OTEL support was there.

reply
johtso
2 days ago
[-]
Why would you not recommend Cloudflare workflows? Was thinking of using them in my current project..
reply
cmdtab
1 day ago
[-]
They inherit all the limitations of DO. For example, if you want to do anything that requires more than 6 TCP connection. Every fetch request will start failing silently because there is no more TCP connection to go through. This was a deal breaker for us. Their solution was split our code into more workflows or DOs.

You are limited to 128 MB ram which means everything has to be steamed. You will rewrite your code around this because many node libraries don't have streaming alternatives for things that spike memory usage.

The observability tab was buggy. Lifecycle chart is hard to understand for knowing when things will evict. Lot of small hidden limitations. Rate limits are very low for any mid scale application. Full Node compatibility is not there yet (work in progress) so needed to change some modules.

Overall, a gigantic waste of time unless you are doing something small scale. Just go with restate/upstash + lambdas/cloud run if you want simpler experience that scales in serverless manner

reply
Shorn
23 hours ago
[-]
That is valuable, info dense comment. Thanks.
reply
barapa
2 days ago
[-]
Agree on the UI - I wish it was improved
reply
qianli_cs
2 days ago
[-]
We heard you! Working on improvements based on user feedback. Stay tuned :)
reply
LudwigNagasena
2 days ago
[-]
What was the reason for the transition?
reply
cmdtab
2 days ago
[-]
Needed checkpoints in some of our jobs wrapping around the AI agent so we can reduce cost and increase reliability (as workflow will start from mid step as opposed to a complete restart).

We already check pointed the agent but then figure it's better to have a generic abstraction for other stuff we do.

reply
diarrhea
2 days ago
[-]
Interesting!

What made you opt for DBOS over Temporal?

reply
cmdtab
2 days ago
[-]
Temporal required re-architecting some stuff, their typescript sdk and sandbox is bit unintuitive to use so would have been an additional item to grok for the team, and additional infrastructure to maintain. There was a latency trade off too which in our case mattered.

Didn't face any issue though. Temporal observability and UI was better than DBOS. Just harder to do incremental migration in an existing codebase.

reply
lacoolj
2 days ago
[-]
So we do this exact thing in our software, and I implement it (along with other devs), and I was still entranced enough to read through the end. No differences between ours and theirs (this is a fairly common practice anyway) but article is written in succinct, informative chunks with "images" (of code) in between.

This is how you write a technical article. Thanks to the author for the nice read :)

reply
alpb
2 days ago
[-]
I've been following DBOS for a while and I think the model isn't too different than Azure Durable Functions (which uses Azure Queues/Tables under the covers to maintain state). https://learn.microsoft.com/en-us/azure/azure-functions/dura...

Perhaps the only difference is that Azure Durable Functions has more syntactic sugar in C# (instead of DBOS choice being Python) to preserve call results in the persistent storage? Where else do they differ? At the end, all of them seem to be doing what Temporal is doing (which has its own shortcomings and it's also possible to get it wrong if you call a function directly instead of invoking it via an Activity etc)?

reply
rubenvanwyk
1 day ago
[-]
This actually looks super amazing for C# ~ but doesn’t use Postgres?? All the backends seem to be purely Azure related / Microsoft products, so although the Framework is Apache2, your infrastructure needs to rely on MS?
reply
KraftyOne
2 days ago
[-]
Both do durable workflows with similar guarantees. The big difference is that DBOS is an open-source library you can add to your existing code and run anywhere, whereas Durable Functions is a cloud offering for orchestrating serverless functions on Azure.
reply
alpb
2 days ago
[-]
As far as I know, Azure Durable Functions doesn't have a server-side proprietary component and it's actually fully open source framework/clients as well. So it's actually not a cloud offering per-se. You can see the full implementations at:

* https://github.com/Azure/durabletask

* https://github.com/microsoft/durabletask-go

reply
KraftyOne
2 days ago
[-]
That's interesting, I'll take a look! I had always thought of it as an Azure-only thing.
reply
rlili
2 days ago
[-]
reply
cpursley
2 days ago
[-]
I've been using https://www.pgflow.dev for workflows which is built on pgmq and am really impressed so far. Most of the logic is in the database so I'm considering building an Elixir adapter DSL.
reply
mmcclure
2 days ago
[-]
Just curious, if you’re already in Elixir and using Postgres, why not use Oban[1]? It’s my absolute favorite background job library, and the thing I often miss most when working in other ecosystems.

[1] https://github.com/oban-bg/oban

reply
cpursley
1 day ago
[-]
Oban is awesome, but I really like the ideas around pgmq and most of the logic living in the database. And the idea of "flows" design from pgflow for multi-step processes (which Elixir is naturally a great match for)
reply
sbrother
1 day ago
[-]
Oban is so good! My startup has an extensive graph of background jobs all managed by Oban, and it's just rock solid, simple to use and gets out of the way.
reply
ishita_julep
2 days ago
[-]
what are you using the DSL for?
reply
cpursley
2 days ago
[-]
It’s used to generate the database migration that defines the flows. More syntax sugar than anything.
reply
atombender
2 days ago
[-]
While DBOS looks like a nice system, I was really disappointed to learn that Conductor, which is the DBOS equivalent of the Temporal server, is not open source.

Without it, you get no centralized coordination of workflow recovery. On Kubernetes, for example, my understanding is that you will need to use a stateful set to assign stable executor IDs, which the Conductor doesn't need.

I suppose that's their business model, to provide a simplistic foundation where you have to pay money to get the grown up stuff.

reply
jedberg
1 day ago
[-]
> Conductor, which is the DBOS equivalent of the Temporal server,

Just to clarify, Conductor is not anything like the Temporal server. In Temporal, the server is a critical component that stores all the running state and is required for Temporal to work (and blocks your app from working if it's down).

Conductor is an out of band connector to give Transact users access to the same observability and workflow management as DBOS Cloud users have, but it isn't required and your app will keep working even if it breaks.

You can run a durable, and scalable, application with just Transact, it's just a lot harder without Conductor to help you.

You are correct that the business model is to provide add ons for Transact applications, but I'd say it's unfair to call Transact a "simplistic foundation" and not "grown up".

Transact is absolutely Enterprise grade software that can run at massive scale.

reply
jumploops
2 days ago
[-]
I've been looking at migrating to Temporal, but this looks interesting.

For context, we have a simple (read: home-built) "durable" worker setup that uses BullMQ for scheduling/queueing, but all of the actual jobs are Postgres-based.

Due to the cron-nature of the many disparate jobs (bespoke AI-native workflows), we have workers that scale up/down basically on the hour, every hour.

Temporal is the obvious solution, but it will take some rearchitecting to get our jobs to fit their structure. We're also concerned with some of their limits (payload size, language restrictions, etc.).

Looking at DBOS, it's unclear from the docs how to scale the workers:

> DBOS is just a library for your program to import, so it can run with any Python/Node program.

In our ideal case, we can add DBOS to our main application for scheduling jobs, and then have a simple worker app that scales independently.

How "easy" would it be to migrate our current system to DBOS?

reply
mnahkies
1 day ago
[-]
As another commentator said, temporal is quite tricky to self host/scale in a cost effective manner. This is also reflected in their cloud pricing (which should've been the warning sign to us tbh)

Overall it's a pretty heavy/expensive solution and I've come to the conclusion it's usage is best limited to lower frequency and/or higher "value" (eg: revenue or risk) tasks.

Orchestrating a food delivery that's paying you $3 of service fees - good use case. Orchestrating some high frequency task that pays you $3 / month - not so good.

reply
jscheel
1 day ago
[-]
This was my problem with Dagster, too. All the documentation and all the examples encourage you to split items into small discrete tasks. Then you realize that their cloud pricing is absolutely bonkers if you go over the paltry 30k credits… unless you sign up for a meaty annual enterprise contract. Got a $500 bill for something like 13k executions over the limit. That’s less than 45k executions in a month. Just for comparison, our main product’s sidekiq queue processes tens of millions of jobs every single day. Just a silly imbalance. I ended up having to combine a bunch of tasks to the point that I started asking myself why I was even bothering with using it at all.
reply
KraftyOne
2 days ago
[-]
I'd love to learn more about what you're building--just reach out at peter.kraft@dbos.dev.

One option is that you have DBOS workflows that schedule and submit jobs to an external worker app. Another option is that your workers use DBOS queues (https://docs.dbos.dev/python/tutorials/queue-tutorial). I'd have to better understand your use case to figure out what would be the best fit.

reply
blumomo
2 days ago
[-]
I’m also interested in what you think can become best practices where we can have (auto-scaling) worker instances that can pick up DBOS workflows and execute them.

Do you think an app’s (e.g. FastAPI) backend should be the DBOS Client, submitting workflows to the DBOS instance? And then we can have multiple DBOS instances with each picking up jobs from a queue?

reply
KraftyOne
2 days ago
[-]
Yeah, I think in that case you should have auto-scaling DBOS workers all pulling from a queue and a FastAPI backend using the DBOS client to submit jobs to the queue.

Queue docs: https://docs.dbos.dev/python/tutorials/queue-tutorial Client docs: https://docs.dbos.dev/python/reference/client

reply
cyberpunk
1 day ago
[-]
Unless you’re planning on using their (temporalio’s) saas you’re in for building a very large database cluster for this if you need some scale.

(source: i run way more cassandra than i ever thought reasonable)

reply
bogantech
1 day ago
[-]
Just got roped into setting up an on prem temporal cluster myself :(

What causes the need for massive database clusters? Now I'm worried this is going to fall apart on us in a very big way

reply
cyberpunk
19 hours ago
[-]
Take a look at the official “basic scaling” guide especially the metric about state transitions / second.

To get an idea of what you’ll need that metric to be try running 1/10th of your workload as a benchmark against it.

In order for our particular setup to handle barely 5000 of these we have almost 100cpus just for cassandra. To double this, it’s 200 cpus just for database.

Oh and make sure you get your history shard count right as you can’t change it without rebuilding it.

Maybe it makes sense for low volume high value jobs e.g uber trips, for high volume low value this doesn’t work economically.

We are likely to drop it.

reply
at0mic22
2 days ago
[-]
Every few years someone discovers FOR UPDATE SKIP LOCKED and represents it. I remember it lasting for 15 years at least
reply
atombender
2 days ago
[-]
The "someone" in this case happens to be Michael Stonebraker, the creator of Postgres and CTO of DBOS.
reply
digdugdirk
1 day ago
[-]
So glad someone else chuckled reading this. Two thumbs up for knowing better than the creator of the thing they're talking about!
reply
qianli_cs
2 days ago
[-]
Yup, some features are timeless and deserve a re-intro every now and then. SKIP LOCKED is definitely one of them.
reply
skrtskrt
2 days ago
[-]
with a nice NOWAIT when appropriate
reply
darkteflon
2 days ago
[-]
Often wondered whether it would be possible / advisable to combine DBOS with, e.g., Dagster if you have complex data orchestration requirements. They seem to deal with orthogonal concerns but complement nicely. Is integration with orchestration frameworks something the DBOS team has any thoughts on?
reply
KraftyOne
2 days ago
[-]
Would love to learn more about what you're building--what problems or parts of your system would you solve with Dagster vs DBOS?
reply
yencabulator
1 day ago
[-]
It's so weird that they went from "operating system with all OS state in a database" on VoltDB to "yet another Typescript framework using Postgres" with the same name.

https://dbos-project.github.io/

reply
secondrow
2 hours ago
[-]
lol - don't overlook DBOS Cloud, a serverless compute platform, which also originated from the DBOS R&D project(s) at MIT/Stanford.
reply
agambrahma
2 days ago
[-]
Curious how this compares to Cloudflare, which is the other provider that is really going for simplified workflows
reply
tonyhb
2 days ago
[-]
Anything that guarantees exactly once is selling snake oil. Side effects happen inside any transaction, and only when it commits (checkpoints) are the side effects safe.

Want to send an email, but the app crashes before committing? Now you're at-least-once.

You can compress the window that causes at-least-once semantics, but it's always there. For this reason, this blog post oversells the capabilities of these types of systems as a whole. DBOS (and Inngest, see the disclaimer below) try to get as close to exactly once as possible, but the risk always exists, which is why you should always try to use idempotency in external API requests if they support it. Defense in layers.

Disclaimer: I built the original `step.run` APIs at https://www.inngest.com, which offers similar things on any platform... without being tied to DB transactions.

reply
KraftyOne
2 days ago
[-]
As the post says, the exactly-once guarantee is ONLY for steps performing database operations. For those, you actually can get an exactly-once guarantee by running the database operations in the same Postgres transaction as your durable checkpoint. That's a pretty cool benefit of building workflows on Postgres! Of course, if there are side effects outside the database, those happen at-least-once.
reply
tonyhb
2 days ago
[-]
You can totally leverage postgres transactions to give someone... postgres transactions!

I just figured that the exactly once semantics were so worth discussing that any external side effects (which is what orchestration is for) aren't included in that, which is a big caveat.

reply
jedberg
2 days ago
[-]
> Anything that guarantees exactly once is selling snake oil.

That's a pretty spicy take. I'll agree that exactly-once is hard, but it's not impossible. Obviously there are caveats, but the beauty of DBOS using Postgres as the method of coordination instead of the an external server (like Temporal or Inngest) is that the exactly-once guarantees of Postgres can carry over to the application. Especially so if you're using that same Postgres to store your application data.

reply
rubenvanwyk
1 day ago
[-]
Been looking at DBOS for a while ~ are there plans to port to other languages such as Java or C#?? Are you open to community ports??
reply
qianli_cs
1 day ago
[-]
Yeah, we plan to add more languages. Currently supports Python and TypeScript, and Go and Java will be released soon. We’re having a preview of DBOS Java at our user group meeting on August 28: https://lu.ma/8rqv5o5z Welcome to join us! We’d love to hear your feedback.

We welcome community contributions to the open source repos.

reply
krashidov
2 days ago
[-]
How does this compare with inngest or restate? We currently use inngest right now and it works great but the typescript API is a bit clunky
reply
KraftyOne
2 days ago
[-]
Like Inngest and Restate, DBOS provides durable workflows. The difference is that DBOS is implemented as a Postgres-backed library you can "npm install" into your project (no external dependencies except Postgres), while Inngest and Restate require an external workflow orchestrator.

Here's a blog post explaining the DBOS architecture in more detail: https://www.dbos.dev/blog/what-is-lightweight-durable-execut...

Here's a comparison with Temporal, which is architecturally similar to Restate and Inngest: https://www.dbos.dev/blog/durable-execution-coding-compariso...

reply
hanikesn
1 day ago
[-]
First: It's great to have some serious competition in the durable execution market! You build some impressive tech already! It took me a decent time to investigate before pulling the trigger on temporal at my current PSP fintech as you'll end up with a pretty hard vendor lock-in which has to be worth it.

I have to say the architecture blog post is very leightweight on details, which makes it hard to judge so you have to dive into the docs for the spicy details and I see abig showstoppers for our usecase:

1. Workflow versioning: It seems currently impossible to properly upgrade inflight workflow without manually forking the workflow

Also the marketing material and docs aretoo handwavy on the idempotency and determinism constraints, which makes them dbos seem to good to be true. I'd also love to see higher level abstractions around message sending. Temporal's signal, query and update are essential building blocks for our use cases and allow some quite interesting use cases as you can really view workflows as actors.

I hope you can keep innovating as temporal really is too heavy for a lot of use-cases.

See: https://docs.dbos.dev/production/self-hosting/workflow-recov... https://www.dbos.dev/blog/handling-failures-workflow-forks

reply
jedberg
1 day ago
[-]
Hey there, DBOS CEO here. I really appreciate your feedback, super helpful! If you are willing, I'd love to get some more feedback from you, specifically around your showstoppers. I have contact info in my HN profile, or you can join our discord: https://discord.com/invite/jsmC6pXGgX

Thanks!

reply
abtinf
2 days ago
[-]
Why not just use Temporal?
reply
KraftyOne
2 days ago
[-]
We wanted to make workflows more lightweight--we're building a Postgres-backed library you can add to your existing application instead of an external orchestrator that requires you to rearchitect your system around it. This post goes into more detail: https://www.dbos.dev/blog/durable-execution-coding-compariso...
reply