FilterHN

Show HN: Lightless Labs Refinery – multi-model consensus and synthesis

2 points

by ElFitz

2 hours ago

| past

| 1 comment

| github.com

| HN

Hi!

In the past few weeks I (mostly Claude) cobbled together a Rust library + cli to run the same prompt across multiple models, through multiple rounds of iterative consensus.

Each model is fed the same initial prompt, produces an answer, then every model individually reviews and scores each of the other model's answers independently. The original prompt, previous answer, and the reviews, are then fed back to the models for the next round, until either one model "wins" two rounds in a row or a limit is reached.

It did quite well on the car wash test (https://github.com/Lightless-Labs/refinery?tab=readme-ov-fil...). Most models answer badly initially, but it just takes one for all of them to quickly converge towards better answers. Although, to my initial surprise, adding more models quickly breaks the current voting+threshold selection strategy.

I also recently added a synthesis mode, which does the same thing but with an additional synthesis round at the end where each model produces a synthesis of all the answers that scored above the threshold in the last round, followed by one last review round.

The total number of calls quickly blows up with rounds and model count, but it's been fun!

Currently, I'm racking my brain trying to figure out a way to select for both diversity and quality, for a "brainstorm" process. If you have any ideas either on that or other features, let me know!

▲

ad-tech

1 hour ago

[-]

The voting thing breaks because youre treating all models equally when they shouldnt be. We ran consensus logic like this on a smaller scale and quickly realized throwing 5 mediocre models at a problem just makes them argue in circle. One good model beats three bad ones always. The synthesis round will get expensive fast too - we started with 2 models doing 3 rounds and it was already costing 40x a single pass. For brainstorm mode maybe weight models by past accuracy instead of pure voting? We do this with our team internally - the person who got it right last time gets listened to more next time, not equal voice to everyone. Could be interesting to test.