Yes: structure beats vibes. Primacy/recency bias is real. Treating prompts as engineering artifacts is empirically helpful.
But:
- “Reads all tokens at once.” Not quite. Decoder LLMs do a parallel prefill over the prompt, then sequential token-by-token decode with causal masks. That nuance is why primacy/recency and KV-cache behavior matter, and why instruction position can swing results.
- Embeddings & “labels.” Embeddings are learned via self-supervised next-token prediction, not from a labeled thesaurus. “Feline ≈ cat” emerges statistically, not by annotation.
- "Structure >> content". Content is what actually matters. “Well-scaffolded wrong spec” will give you confidently wrong output.
- Personas are mostly style. Yes, users like words typed in their style better, but it'll actually hide certain information that a "senior software engineer" might not know.
I don't really get the Big-O analogy thing, either. Models are constantly exposing and shifting how they direct attention, which is exactly the opposite of the durably true nature of algorithmic complexity. Memorizing how current models like their system prompts written is hardly the same thing.
The headers go from numbered to having just 1 section with a sub-number. Section 3 has 3.1-3.4, and then the next section doesn't follow it.
I noticed this when doing large-scale LaTeX documentation builds if you are not explicit about the formatting. The well-scaffolded build falls apart since the token count is too high, and a proper batch process is not in place, or the prompt is vague. "Use clear headings and bullet points" is not precise. Depending on your document type, you need to state all requirements to design with attention.
> 1. the branch of science and technology concerned with the design, building, and use of engines, machines, and structures.
2. the action of working artfully to bring something about.
So you're trying to learn what / how to prompt, in order to "bring something about" (the results you're after).
-A Prompt Architect
It's funny because of the irony of "prompt engineering" being as close to cargo culting as things get. No one knows what the model is or how it's structured in a higher level (non implementation) sense, people just try different things until something works, and try what they've seen other people do.
This article is at least interesting in that it takes a stab at explaining prompt efficacy with some sort of concrete basis, even if it lacks rigor.
It's actually a really important question about LLMs: how are they to be used to get the best results? All the work seems to be on the back end, but the front end is exceedingly important. Right now it's some version of Deep Thought spitting out the answer '42'.
Seems like an LLM should be able to judge a prompt, and collaboratively work with the user to improve it if necessary.
https://www.dbreunig.com/2025/06/10/let-the-model-write-the-... is an example.
You can see the hands on results in this hugging face branch I was messing around in:
here is where I tell the LLM to generate prompts for me based on research so far
https://github.com/AlexChesser/transformers/blob/personal/vi...
here is the prompts that produced:
https://github.com/AlexChesser/transformers/tree/personal/vi...
and here is the result of those prompts:
https://github.com/AlexChesser/transformers/tree/personal/vi.... (also look at the diagram folders etc..)
Write your prompt in some shape and ask grok
Please rewrite this prompt for higher accuracy
-- Your prompt
How do you know it won't introduce misinformation about white genocide into your prompt?
We have to read the result of course.
A lot of what I received as input was more like the first type of instruction, what I sent to the actual development team was closer to the second.
For example, pass an LLM a JSON structure with keys but no values and it tends to do a much better job populating the values than trying to fully generate complex data from a prompt alone. Then you can take that populated JSON to do something even more complex in a second prompt.
I found the “Big-O” analogy a bit strained & distracting, but still a good read.
... but you know how editors are with writing the headline for clicks against the wishes of the journalist writing the article. You'll always see Journos sayign stuff like "don't blame me, that's my editor, I don't write the headlines"
I did toy with the idea of going with something like: `Prompt Engineering is a wrapper around Attention.`
But my editor overruled me *FOR THE CLICKS!!!*
Full disclosure: I'm also the editor
“This is why prompt structure matters so much. The model isn’t reading your words like you do — it’s calculating relationships between all of them simultaneously. Where you place information, how you group concepts, and what comes first or last all influence which relationships get stronger weights. This is why prompt structure matters so much. The model isn’t reading your words like you do — it’s calculating relationships between all of them simultaneously”
Reprimand the editor. ;)
I look forward to using the ideas in this, but would be much more excited if you could benchmark these concepts somehow, and provide regular updates about how to optimize.
Because medium is such a squirrely interface I find myself writing in markdown in vscode then copying and pasting sections across. If I make an edit after I've stared inserting images and embedding the gists it gets a bit manual.
Your comment in addition to another one about finding a way to compare the outputs of the good/bad prompts side by side - 100% agree. This could be more robust.
While I am running a process transformation against production teams in small isolated experimental groups, I can say I'm getting really great feedback so far.
Both with the proprietary stuff happening in the job, and with the feedback I'm getting back from the engineers I've shared this with in the wider industry.
Feedback from colleages who have started taking "selected pieces" from the "vibe engineering" flow (https://alexchesser.medium.com/vibe-engineering-a-field-manu...) has been really positive.
> @Alex Chesser i've started using some of your approach, in particular having the agent write out a plan of stacked diffs, and then having separate processes to actually write the diffs, and it's a marked improvement to my workflow. Usually the agent gets wonky after the context window fills up, and having the written plan of self contained diffs helps a lot with 'checkpoints' so I can just restart at any time! Thanks!
from someone else:
> I just went through your first two prompts and I'm blown away. I haven't done much vibe coding yet as I've gotten initial poor results and don't trust the agent to do what I want. But the output for the architecture and the prompts are mind blowing. This tutorial is giving me the confidence to experiment more.
benchmarking feedback vs. qualitative devex feedback is definitely a thing though.
editor's note: title also chosen for the clicks.
Can you support that assertion in a more rigorous way than "when I do that I seem to get better results?"
I'm glad I didn't, though, because I had no idea how the LLM is actually interpreting my commands, and have been frustrated as a result.
Maybe a title like "How to write LLM prompts with the greatest impact", or "Why your LLM is misinterpreting what you say", or something along those lines.
Do you have preference n a more continual system like Claude Code for one big prompt or just trying to do one task and starting something new?
One of the things that I think is pretty great about being able to share these particular prompts is that you can run this on one of your own repos to see how it turns out.
ACTUALLY!! Hold on. A couple weekends ago I spent some time doing some underlying research with huggingface/transformers and I have it on a branch.
https://github.com/AlexChesser/transformers/tree/personal/vi...
You can look at the results of an architectural research prompt.
Unfortunately I don't have a "good mode" side by side with a "bad mode" at the moment. I can work on that in the future.
The underlying research linked has the experimental design version of this with each piece evaluated in isolation.
LLMs are random and unpredictable, the opposite of what real engineering is. We better start using terms like "invocations", "incantations", "spells" or "rain dance"/"rituals" to describe how to effectively "talk" to LLMs, because a science it most definitely isn't.
And yeah, taking the five seconds extra to do the bare minimum in structuring your communication will yield better results in literally any effort. Don't see why this concept deserves an article.
PS I am also extremely triggered from the idea of comparing Big-O, a scientific term and exact concept with well understood and predictable outcomes, with "prompt engineering" which is basically "my random thoughts and anecdotal biases of how to communicate better with one of the many similar but different fancy autocompletes with randomness built in".
In addition, for the technical aspect to make sense, a more effective article would place the points should be shown alongside evals. For example, if you're trying to make a point about where to put important context in the prompt, show a classic needle-in-the-haystack eval, or a jacobian matrix, alongside the results. Otherwise it's largely more prompt fluff.
Not to doubt you but could you explain why so?