I evaluated temporal, trigger, cloudflare workflows (highly not recommended), etc and this was the easiest to implement incrementally. Didn't need to change our infrastructure at all. Just plugged the worker where I had graphile worker.
The hosted service UX and frontend can use a lot of work though but it's not necessary for someone to use. OTEL support was there.
You are limited to 128 MB ram which means everything has to be steamed. You will rewrite your code around this because many node libraries don't have streaming alternatives for things that spike memory usage.
The observability tab was buggy. Lifecycle chart is hard to understand for knowing when things will evict. Lot of small hidden limitations. Rate limits are very low for any mid scale application. Full Node compatibility is not there yet (work in progress) so needed to change some modules.
Overall, a gigantic waste of time unless you are doing something small scale. Just go with restate/upstash + lambdas/cloud run if you want simpler experience that scales in serverless manner
We already check pointed the agent but then figure it's better to have a generic abstraction for other stuff we do.
What made you opt for DBOS over Temporal?
Didn't face any issue though. Temporal observability and UI was better than DBOS. Just harder to do incremental migration in an existing codebase.
This is how you write a technical article. Thanks to the author for the nice read :)
Perhaps the only difference is that Azure Durable Functions has more syntactic sugar in C# (instead of DBOS choice being Python) to preserve call results in the persistent storage? Where else do they differ? At the end, all of them seem to be doing what Temporal is doing (which has its own shortcomings and it's also possible to get it wrong if you call a function directly instead of invoking it via an Activity etc)?
Without it, you get no centralized coordination of workflow recovery. On Kubernetes, for example, my understanding is that you will need to use a stateful set to assign stable executor IDs, which the Conductor doesn't need.
I suppose that's their business model, to provide a simplistic foundation where you have to pay money to get the grown up stuff.
Just to clarify, Conductor is not anything like the Temporal server. In Temporal, the server is a critical component that stores all the running state and is required for Temporal to work (and blocks your app from working if it's down).
Conductor is an out of band connector to give Transact users access to the same observability and workflow management as DBOS Cloud users have, but it isn't required and your app will keep working even if it breaks.
You can run a durable, and scalable, application with just Transact, it's just a lot harder without Conductor to help you.
You are correct that the business model is to provide add ons for Transact applications, but I'd say it's unfair to call Transact a "simplistic foundation" and not "grown up".
Transact is absolutely Enterprise grade software that can run at massive scale.
For context, we have a simple (read: home-built) "durable" worker setup that uses BullMQ for scheduling/queueing, but all of the actual jobs are Postgres-based.
Due to the cron-nature of the many disparate jobs (bespoke AI-native workflows), we have workers that scale up/down basically on the hour, every hour.
Temporal is the obvious solution, but it will take some rearchitecting to get our jobs to fit their structure. We're also concerned with some of their limits (payload size, language restrictions, etc.).
Looking at DBOS, it's unclear from the docs how to scale the workers:
> DBOS is just a library for your program to import, so it can run with any Python/Node program.
In our ideal case, we can add DBOS to our main application for scheduling jobs, and then have a simple worker app that scales independently.
How "easy" would it be to migrate our current system to DBOS?
Overall it's a pretty heavy/expensive solution and I've come to the conclusion it's usage is best limited to lower frequency and/or higher "value" (eg: revenue or risk) tasks.
Orchestrating a food delivery that's paying you $3 of service fees - good use case. Orchestrating some high frequency task that pays you $3 / month - not so good.
One option is that you have DBOS workflows that schedule and submit jobs to an external worker app. Another option is that your workers use DBOS queues (https://docs.dbos.dev/python/tutorials/queue-tutorial). I'd have to better understand your use case to figure out what would be the best fit.
Do you think an app’s (e.g. FastAPI) backend should be the DBOS Client, submitting workflows to the DBOS instance? And then we can have multiple DBOS instances with each picking up jobs from a queue?
Queue docs: https://docs.dbos.dev/python/tutorials/queue-tutorial Client docs: https://docs.dbos.dev/python/reference/client
(source: i run way more cassandra than i ever thought reasonable)
What causes the need for massive database clusters? Now I'm worried this is going to fall apart on us in a very big way
To get an idea of what you’ll need that metric to be try running 1/10th of your workload as a benchmark against it.
In order for our particular setup to handle barely 5000 of these we have almost 100cpus just for cassandra. To double this, it’s 200 cpus just for database.
Oh and make sure you get your history shard count right as you can’t change it without rebuilding it.
Maybe it makes sense for low volume high value jobs e.g uber trips, for high volume low value this doesn’t work economically.
We are likely to drop it.
Want to send an email, but the app crashes before committing? Now you're at-least-once.
You can compress the window that causes at-least-once semantics, but it's always there. For this reason, this blog post oversells the capabilities of these types of systems as a whole. DBOS (and Inngest, see the disclaimer below) try to get as close to exactly once as possible, but the risk always exists, which is why you should always try to use idempotency in external API requests if they support it. Defense in layers.
Disclaimer: I built the original `step.run` APIs at https://www.inngest.com, which offers similar things on any platform... without being tied to DB transactions.
I just figured that the exactly once semantics were so worth discussing that any external side effects (which is what orchestration is for) aren't included in that, which is a big caveat.
That's a pretty spicy take. I'll agree that exactly-once is hard, but it's not impossible. Obviously there are caveats, but the beauty of DBOS using Postgres as the method of coordination instead of the an external server (like Temporal or Inngest) is that the exactly-once guarantees of Postgres can carry over to the application. Especially so if you're using that same Postgres to store your application data.
We welcome community contributions to the open source repos.
Here's a blog post explaining the DBOS architecture in more detail: https://www.dbos.dev/blog/what-is-lightweight-durable-execut...
Here's a comparison with Temporal, which is architecturally similar to Restate and Inngest: https://www.dbos.dev/blog/durable-execution-coding-compariso...
I have to say the architecture blog post is very leightweight on details, which makes it hard to judge so you have to dive into the docs for the spicy details and I see abig showstoppers for our usecase:
1. Workflow versioning: It seems currently impossible to properly upgrade inflight workflow without manually forking the workflow
Also the marketing material and docs aretoo handwavy on the idempotency and determinism constraints, which makes them dbos seem to good to be true. I'd also love to see higher level abstractions around message sending. Temporal's signal, query and update are essential building blocks for our use cases and allow some quite interesting use cases as you can really view workflows as actors.
I hope you can keep innovating as temporal really is too heavy for a lot of use-cases.
See: https://docs.dbos.dev/production/self-hosting/workflow-recov... https://www.dbos.dev/blog/handling-failures-workflow-forks
Thanks!