First prompt validates the input. Second prompt starts the actual content generation.
Combine both streams with SSE on the front end and don't render the content stream result until the validation stream returns "OK". In the SSE, encode the chunks of each stream with a stream ID. You can also handle it on the server side by cancelling execution once the first stream ends.
Generally, the experience is good because the validation prompt is shorter and faster to last (and only) token.
The SSE stream ends up like this:
data: ing|tomatoes
data: ing|basil
data: ste|3. Chop the
I have a writeup (and repo) of the general technique of multi-streaming: https://chrlschn.dev/blog/2024/05/need-for-speed-llms-beyond... (animated gif at the bottom).This is hard to fix because if you don't wait until you have enough context, you've given your censor a hair trigger.
> Combine both streams with SSE on the front end and don't render the content stream result until the validation stream returns "OK".
Just a note that this particular implementation has the additional problem of not actually applying your validation stream at the API level, which means your service can and will be abused worse than it would be if you combined the streams server-side. You should never rely on client-side validation for security or legal compliance.
For most consumer use cases, it probably doesn't matter if a few tokens leak before the about, especially if they're not rendered.
Tune it to your needs :)
As far as I know, there's no way of validating a streamed response until those tokens have already been streamed unfortunately. You could try buffering the stream in larger chunks before displaying them on screen in the hopes that you might be able to catch it earlier, but that's not going to be a great user experience either.
You could of course use us and get that out of the box if you have access to Databricks.
The core concept is to pass information into the model using a cipher. One that is not too hard that it can't figure it out, but not too easy as to be detected.
And yes, o1 was jailbroken shortly after release: https://x.com/elder_plinius/status/1834381507978280989
1 - the first option is to break this in to three prompts. The first prompt is either write a brief version, an outline of the full response, or even the full response. The second prompt is a validator, so you pass the output of the first to a prompt that says "does this follow the instructions. Return True | False." If True, send it to a third that says "Now rewrite this to answer the user's question." If False, send it back to the first with instructions to improve the response. This whole process can mean it takes 30 seconds or longer before the streaming of the final answer starts.
There are plenty of variations on the above process, so obviously feel free to experiment.
2 - The second option is to have instructions in your main prompt that says "Start each response with an internal dialogue wrapped in <thinking> </thinking> tags. Inside those tags first describe all of the rules you need to follow, then plan out exactly how you will respond to the user while following those rules."
Then on your frontend have the UI watch for those tags and hide everything between them from the user. This method isn't perfect, but it works extremely well in my experience. And if you're using a model like gpt-4o or claude 3.5 sonnet, it makes it really hard to make a mistake. This is the approach we're currently going with.
add some latency to the first token and then "stream" at the rate you received tokens even though the entire thing (or some sizable chunk) has been generated. that'll give you the buffer you need to seem fast while also staying safe.
Give examples of how the LLM should respond. Always give it a default response as well (e.g. "If the user response does not fall into any of these categories, say x").
> I can manually add validation on the response but then it breaks streaming and hence is visibly slower in response.
I've had this exact issue (streaming + JSON). Here's how I approached it: 1. Instruct the LLM to return the key "test" in its response. 2. Make the streaming call. 3. Build your JSON response as a string as you get chunks from the stream. 4. Once you detect "key" in that string, start sending all subsequent chunks wherever you need. 5. Once you get the end quotation, end the stream.
Perfect is the enemy of good enough.
I'd fed in a raw transcript and I was asking it to do some basic editing, remove ums and ahs, that kind of thing.
It had streamed about 80% of the episode when it got to a bit where the podcast guest started talking about "bombing a data center"... and right in front of my eyes the entire transcript vanished. Claude effectively retracted the entire thing!
I tried again in a fresh window and hit Ctrl+A plus Ctrl+C while it was running to save as much as I could.
I don't think the latest version of Claude does that any more - if so, I've not seen it.
You're right - prompt eng. alone doesn't work. It's brittle and fails on most evals.
Ping me at shaunayrton@galini.ai