Compare with drug trials: Adderall only differs from regular amphetamine in the relative concentration of enantiomers, and the entire value of the drug is in the measurements.
Drug trials may be expected to be somewhat reproducible.
What I don't get is how it can even be called research if it cannot be expected to be reproducible at all!
GPT is a closed source/weights, proprietary product that changes every couple of weeks or so. How can you expect a prompt to do the same for a reasonable length of time for the research to be even rudimentarily reproducible? And if it's not reproducible, what is it actually worth? I don't think much. Could as well have been a fault in the research setup or a fake.
Do you have any evidence that the weights for versioned models are being changed without notifications?
I think in a real scientific process, it's upon those who claim that they are not to provide the evidence.
You can easily see this because it can be flipped around easily - you made a claim that they are being changed, even every few weeks! Should it really be on me to show that your very specific claim is false?
Aside - but even if the model weights did change, that wouldn't stop research being possible. Otherwise no drug trial could be replicated because you couldn't get the exact same participants at the exact same age.
Wait a minute? The author of such a paper makes a claim about some observation that's based on the assumption that the studied model is defined in a way. I am disputing that claim since no evidence has been shown that it is defined because no definition has been given.
If your twist on this issue would be true, then I would, by definition, have to accept everything that they claim as true without any evidence. That's not called science. That's called authority.
You are entirely within your rights to say that the authors have assumed that openai is not lying about their models. They've probably also assumed that other paper authors are not lying in their papers.
You then say however:
> GPT is a closed source/weights, proprietary product that changes every couple of weeks or so.
And when I ask for evidence of this very specific claim, you turn around and say the burden is on me to show that you're lying. That is what is butchering the concept of burden of proof.
> If your twist on this issue would be true, then I would, by definition, have to accept everything that they claim as true without any evidence.
Absolutely not.
A company with a proprietary product that says something is not acceptable evidence in a scientific context. No need to allege that anyone is lying. Lying is irrelevant. What's relevant is that the research is falsifiable. It cannot be falsifiable if you don't know what the actual model is at a given point in time.
You couldn’t get the same participants, but you could get the same drugs. If you could get identical participants, that wouldn’t be very helpful since humans are so varied.
But for GPT based papers, what you’re actually testing could change without you knowing. There’s no way to know if a paper is reproducible at all.
If you can’t reproduce results, is it really research, or just show and tell?
You can't start with a statement about clinical trials not being perfectly reproducible and that's fine, then say this.
> what you’re actually testing could change without you knowing
If people are lying about an extremely important part of their product, which they have little reason to. But then this applies to pretty much everything. Starting with the assumption that people are lying about everything and nothing is as it seems may technically make things more reproducible but it's going to require unbelievable effort for very little return.
> There’s no way to know if a paper is reproducible at all.
This is a little silly because these models are available extremely easily and at a pay-as-you-go pricing. And again, it requires an assumption that openai is lying about a specific feature of a product.
Nobody said that to begin with. Re-read their comment.
> If people are lying about an extremely important part of their product [...]
Nobody is alleging that anyone is lying. It's just that we cannot be sure what the research actually refers to, because of the nature of a proprietary/closed model.
> This is a little silly because these models are available extremely easily and at a pay-as-you-go pricing.
What does this have to do with the parent comment? I don't think it's appropriate to call anyone here silly, just because you don't like their comment and don't have good counter arguments.
Let's be clear, you have made an explicit claim that openai are lying.
> What does this have to do with the parent comment?
Because many other fields would kill for this level of reproducibility, grab an API key, spend a few quid running a script and you can get the results yourself.
Why would they lie about it?
The whole point of these versions is so that when you build on top of that it would keep working as you expect.
- realizing their model leaks confidential information against malicious prompts
- copyright claims against them forcing them to remove bits of data from the training set
- serious "alignment" bugs that need to be fixed
- vastly improved optimization techniques that slightly affect results in 0.1% of the cases
If updating the model would save the company a couple hundred million dollars, they might want to do it. And in some of the cases, I can imagine they have an incentive to keep the update low key.
That's why you capture multiple of them and verify your data statistically?
The problem is that what works on small LLMs does not necessarily scale to larger ones. See page 35 of [1] for example. A researcher only using the models of a few years ago (where the open models had <1B parameters) could come to a completely incorrect conclusion: that language models are incapable of generalising facts learned in one language to another.
The Twitter user doesn't even reference a single specific paper, kind of doing some hand wavy broad generalizations of his worst antagonists. So who really knows what he's talking about? I can't say.
If he means papers like the ones in this search - https://arxiv.org/search/?query=step+by+step+gpt4&searchtype... - they're all kind of interesting, especially https://arxiv.org/abs/2308.06834 which is the kind of "new prompt" class he's directly attacking. It is interesting because it was written by some doctors, and it's about medicine, so it has some interdisciplinary stuff that's more interesting than the computer science stuff. So I don't even agree with the premise of what the Twitter complainer is maybe complaining about, because he doesn't name a specific paper.
Anyway, to your original point, if we're comparing the research I linked and astronomy... well, they're completely different, it is totally intellectually dishonest to compare the two. Like tell me how I use astronomy research later in product development or whatever? Maybe in building telescopes? How does observing the supernova suggest new telescopes to build in the future, without suggesting that indeed, I will be reproducing the results, because I am building a new telescope to observe another such supernova? Astronomy cares very deeply about reproducibility, a different kind of reproducibility than these papers, but maybe more the same in interesting ways than the non-difference you're talking about. I'm not an astronomer, but if you want to play the insight porn game, I'd give these people a benefit of the doubt.
Sorry, I won't blindly believe a company who are cynical enough to call themselves "OpenAI", then publish a commercial closed source/weights model for profit.
Evidence that they do not change without notice or it didn't happen. Better even, provide the source and weights for research purposes. These models could be pulled at every instant if the company sees fit or ceases to exist.
Seems to have hit hard?
I would find it borderline acceptable being offended by a user whose name has obviously been generated using a password generator if you could at least provide some substance to the discussion. Just labeling someone and questioning their competence based on your hurt feelings is a bit low. Please improve.
I think this depends a lot on the "culture" of the subject area. For example in mathematics, it is common that only new results that have been thoroughly worked through are typically "publish-worthy".
- hypothesis building
- experimental design
- doing experiments
- analyzing the experimental results
- doing new experiments
- analyzing in which sense the collected data support the hypothesis or not
- ...
work.
Even just the mere presence of data and data visuals is enough to legitimize what you're selling in the eyes of the prospect. When the prevailing religion is Scientism, data bestows that blessing of authority and legitimacy upon whatever it is you're trying to sell. Show and tell whatever conclusions you'd like from the data - the soundness of the logic supporting that conclusion is irrelevant. All that matters is you did the ritual of measuring and data-gathering and graph-ifying and putting it on display for the prospect.
There's a great book, How to Lie with Statistics, that covers this particular case, but demonstrates other popular ways in which data and data visuals are manipulated to sell things.
You can turbo boost your career by mastering the art of “data ritual”. It doesn’t matter what the results are or magnitude of impact or what it cost to build and launch something. Show your results in a pretty way that looks like you did your diligence and you will be celebrated.
I'm not sure this is true.
While modern Adderall has a closely controlled mixture of multiple enantiomers, it hasn't always been this way.
Medicine historically didn't care nearly as much about racemic mixtures, and the possibility of stereo toxicity (eg Thalidomide).
Many drugs in modern human history, including mixed amphetamine salts, have been marketed with very little concern for racemic purity.
If you do the rigor on why something really is interesting, publish it.
Doubly so when there's a new breakthrough, where one of your low-effort papers might end up being the first saying something obvious that ends up being really important. Because then everyone will end up quoting your paper in perpetuity.
If OpenAI disappears tomorrow, papers are GTP-4 will likely be of little to no value, which is another tell of a non-scientific exploration.
(note: not all explorations are scientific, and that is great! Science is just one of many tools for exploring lived reality.)
The idea that science has to happen in a lab is of course absurd as well.
The main point here is that anyone can likely start studing those endangered species and try to reproduce the results while in GPT4 it is not possible at all. The lab point is related to fact that we are talking about the software here.
With black-boxed generated models, there is no way to tell how they have been actually generated. It has no value for science to how to improve it further.
As as aside, I was surprised at the repeated misspelling of GPT-4 and take it as a heuristic that this comment was likely written by a real human :)
https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c...
I do the whole NLP publishing thing and I’ve hesitated to “write a paper” about applying techniques already known and used everywhere in the stable diffusion community to NLP models. That said, the AI community loves to pretend like they discovered something, such as a recent paper purporting to be the first to do “concept slider” Lora’s, despite these existing for months before that work was published on Civit.ai. The authors of course didn’t cite those already existing models.
Everyone chasing citations and clout hard right now because these professors and researchers realize that they only have 5-10 years before AI eats their jobs and most other white collar jobs. I don’t blame them. I want my mortgage paid off before I’m automated away!
As a result, typical professions that used to confer prestige, and for which prestige was supposed to be just reward, such as a professor, a medical doctor, a judge, are now mainly pursued for pecuniary reasons (money, job security). And because they're not doing it for prestige, they don't necessarily care about being right/correct. Playing the game to maximize the revenue streams is paramount. I happen to know a number of faculty who are quite proud of their multiple revenue streams. This would be unthinkable for an academic 50 years ago.
For code generation, GPT4 is getting beat by the small prompt library LATS wrapped around GPT3.5. Given the recent release of MagicCoder / Instruct-OSS, that means a small prompt library + a small 7B model you can self-host beats the much fancier GPT4.
Similar to when simple NNs destroyed a decade of Bayesian modeling theses & research programs, it's frustrating for folks going other paths. But it doesn't make the work 'wrong'.
The link you shared doesn’t quite reflect this. Omitting other models…
LATS (gpt-4): 94.4 Reflexion (gpt-4): 91.0 gpt-4: 86.6 … LATS (gpt-3.5): 83.8 … zero-shot (gpt-4): 67.0 zero-shot (gpt-3.5): 48.1
I’m not quite sure how to translate leaderboards like these into actual utility, but it certainly feels like “good enough” is only going to get more accessible and I agree with what I think is your broader point - more sophisticated techniques will make small, affordable, self-hostable models viable in their own right.
I’m optimistic we’re on a path where further improvement isn’t totally dependent on just throwing money at more parameters.
Given standalone GPT3.5 is "just" 48.. it's less about beating and more about meeting
RE:Good Enough & Feel... very much agreed. I find it very task dependent!
For example, GPT4 is 'good enough' that developers are comfortable copy-pasting & trying, even vs stack overflow results. We haven't seen LATS+MagicCoder yet, but as MagicCoder 7b already meets+exceeds GPT3.5 for HumanEval, there's a plausible hope for agent-aided GPT4-grade tools being always-on for all coding tasks, and sooner vs later. We made that bet for Louie.AI's interactive analyst interface, and as each month passes, evidence mounts. We can go surprisingly far with GPT3.5 before wanting to switch to GPT4 for this kind of interaction scenario.
Conversely... I've yet to see a true long-running autonomous coding autoGPT where the error rate doesn't kill it. We're experimenting with design partners on directions here -- think autonomous investigations etc -- but there's more on the advanced fringe and with special use cases, guard rails, etc. For most of our users and use cases... we're able to more reliably deliver -- today -- on the interactive scenarios with smaller snippets.
What I am really asking is "what makes something a paper and not a blogpost"?
>Submissions to arXiv are subject to a moderation process that classifies material as topical to the subject area and checks for scholarly value. Material is not peer-reviewed by arXiv - the contents of arXiv submissions are wholly the responsibility of the submitter and are presented “as is” without any warranty or guarantee.
"Registered users may submit articles to be announced by arXiv. There are no fees or costs for article submission. Submissions to arXiv are subject to a moderation process that classifies material as topical to the subject area and checks for scholarly value. Material is not peer-reviewed by arXiv - the contents of arXiv submissions are wholly the responsibility of the submitter and are presented “as is” without any warranty or guarantee." [0]
They are commonly known as pre-prints, in a similar fashion to IACR ePrint [1] for cryptography.
Shouldn't be the tooling around it good enough that a few prompt papers don't overload the system?
Parallel to the "you use copilot so your code quality is terrible and you don't really even understand it so it's not maintainable" human coping we are familiar with.
If there is any shred of truth to these defenses, it is temporary and will be shown false by future, more powerful AI models.
Consider the theoretical prompt that allows one of these models to rapidly improve itself into an AGI. Surely you'd want to read that paper right?
AI recursing on itself progressing toward an AGI.
Inbreeding LLMs will result in an "AGI"?
In this universe, if you try to get something from nothing, you just end up with noise.
We recognize some outputs as high quality, and others as low quality, but often can't articulate the exact reason why. It seems that some people are able to reliably produce high quality results, indicating there is some kind of skill involved. More precisely, the quality of an individual artist's last output is positively correlated with the quality of their next output. A kind of imprecise "shop talk" has emerged, self describing as "prompt engineering", which resembles the conversations artists in other mediums have.
For people in tech this will seem most similar to graphic designers. They produce much nicer looking interfaces than lay people can. We often can't explain why, but recognize it to be the case. And graphic designers have their own set of jargon, which is useful to them, but is not scientific.
"Prompt artist" is a better term than "prompt engineer".
The science and engineering parts all have a measure of quality, sometimes that's a human rating, sometimes it's cross-entropy loss. There's nothing stopping someone from using the scientific method to investigate these things, but descriptively I haven't seen anyone, calling themselves a "prompt engineer/scientist", doing that yet.
"I used these words, and I got this output which is nice" sounds like, "I tried using these brushes and I made this painting which is nice". I can agree with the painting being nice, but not that science was used to engineer a nice painting.
It gave me X information, yes or no?
https://not-just-memorization.github.io/extracting-training-...
There is supporting analysis and measurement, but the essence is a single type of prompt, and DeepMind is a heavyweight lab I think it’s fair to say.
Moreover there’s evidence people independently reported this result months beforehand on Reddit based on casual observation.
"If you’re a researcher, consider pausing reading here, and instead please read our full paper for interesting science beyond just this one headline result. In particular, we do a bunch of work on open-source and semi-closed-source models in order to better understand the rate of extractable memorization (see below) across a large set of models."
So they are trying to rigorously quantify the behaviour of the model. Is this "look mom no hands"... I don't think so.
https://arxiv.org/abs/2311.17035
I think it’s valid work, but the original tweet seems to call a prompt based paper into question.
At least enough to clarify where he would stand on an example like this.
Some of the most interesting papers published this year ("Automatic Multi-Step Reasoning and Tool-Use") compare prompt strategies across a variety of tasks. The results are fascinating, findings are applicable and invite further research in the area of "prompt selection" or "tool selection."
In programming we have a similar phenomenon, that StackOverflow-driven (and I guess now GPT-driven) juniors have overtaken the industry and displaced serious talent. Because sufficient amounts of quantity always beats quality, even if the end result is inferior, this is caused by market dynamics, which operate on much cruder parameters than the sophisticated analysis of an individual noticing everything around them becoming "enshittified".
SO-driven juniors are cheap, plentiful, and easily replaceable. And a business that values less expense and less risk therefore prefers them, because it has no way to measure the quality of the final product with simple metrics.
The same mechanism is driving AI replacing our jobs currently, the avalanche of garbage papers by academics. This is entropy for you. We see it everywhere in modern society down to the food we eat. Quality goes away, replaced by cheap to produce and long shelf life.
If we don't fundamentally alter what the system sees as ACCEPTABLE, and VALUABLE, this process will inevitably continue until our world is completely unrecognizable. And to fundamentally alter the system, we need an impulse that aligns us as a society, startles us into action, all together (or at least significant majority of us). But it seems we're currently in "slowly boiled frog mode".
There will be good data, the pre-AI enshitification data. The stuff from before the war.
And then... the data after. Tainted by the entropy, and lack of utility of AI.
Alas, this means in some senses, human progress will slow and stop in the tech field if we aren't careful and preserve ways to create pre-AI data. But the cost of it is so high in comparison to post... I'm not sold it will be worth it.
I use AI day to day. I see what it can do and can't.
But when you see pages, and pages, and pages of GPT spam all over the place. Finding the nuggets of wisdom will be much harder than before, the "bomb" was dropped.
Thus actually leading to the whole FO4 main plot.
Yes, life will always find a way. And yes humanity can not put the genie in a bottle, we are much more likely to put it on a Fat Man catapult.
But it means, that in a sense... that we will all have to accept this background radiation of AI shit, as part of our new norm.
And this isn't the first time I've thought in similar ways. I remember reading older math texts (way pre-computer works in things like diffeq and PDE) and often thinking the explanations were clearer. Probably because of the increased effort to actually print something.
Who knows... maybe I'm just an old coot seeing patterns where there are none.
This is 100% the case in chess as well. The books before and after the computer era are orders of magnitude different in terms of readability. I think a major shift in society has been in motivation. In the past if you were studying advanced mathematics, let alone writing about it, it was solely and exclusively because you absolutely loved the field. And you were also probably several sigmas outside the mean intellectually. Now? It's most often because of some vague direction such as wanting a decent paying job, which may through the twists and turns of fate eventually see you writing books in a subject you don't particularly have much enthusiasm for.
And the much more crowded 'intellectual market', alongside various image crafting or signaling motivations, also creates a really perverse incentive. Exceptional competence and understanding within a subject makes it easy to explain things, even the esoteric and fabulously complex. See: Richard Feynman. But in modern times there often seems to be a desire to go the other direction - and make things sound utterly complex, even when they aren't. I think Einstein's paper on special relativity vs many (most?) modern papers is a good example. Einstein's writing was such that anybody with a basic education could clearly understand the paper, even if they might not fully follow the math. By contrast, so many modern papers seem to be written as if the author had a well worn copy of the Thesaurus of Incomprehensibility at his bedside table.
Hegel (as echoed by Marx): "merely quantitative differences beyond a certain point pass into qualitative changes" ( https://www.pnas.org/doi/10.1073/pnas.240462397 )
I always found that an interesting observation, whatever you think of the rest of their works.
The best people have silly beliefs, the worst people have great insights, and the vast majority of us are in-between.
If we discard everything "tainted" by imperfection, we'll be left with nothing good.
Anyone whose even tried to read stuff of his I.e the phenomenology of spirit will tell you that he’s a charlatan and hack, and the people who constantly cite him (I.e Zizek, Lacan, Foucault) are also hacks.
Myself, I appreciate Schopenhauer a lot more, and we know what he thought of Hegel, but I'm not fanatical about it. If a hack hits on a good line, I'll nab it.
If you’ve developed a new prompt for a model whose weights you can directly access, then this prompt could have scientific value because its utility will not diminish over time or be erased with a new model update. I’m even generally of the view that a closed API endpoint whose expiration date is years into the future could have some value (but much less so). But simply finding a prompt for something like ChatGPT is not useful for science because we don’t even have certainty about which model it’s executing against.
Note that some of the best uses of these models and prompting have nothing to do with academics; this is a comment focused on the idea about writing academic papers about prompts.
Additionally, it gives people other ideas to try for themselves. And some of this stuff might be useful to someone in a specific scenario.
It’s not glamorous research or even future-proof seeing as how certain prompts can be surgically removed or blocked by the owner of the model, but I don’t think it warrants telling people not to do it.
Back in the day would compiler optimizations be not worthy of publishing?
I'd rather we had a few too many bad papers than a few too few great papers.
The similarity being that it’s ego masquerading as academic.
Most things shared from there should have just been a blog post.
The last year has showed that AI/ML research and use did not need academic gatekeeping by PhDs and yet many in that scene keep trying self infatuating things with the lowest utility.
I imagine that most of these will simply have had little to no impact, and will only serve to bolster the publication list of those who wrote them.
[1] https://www.wbur.org/news/2022/07/27/harvard-shorenstein-research-january-6-insurrection-president
Wired: Asking an LLM to write out their steps first makes them more accurate.
They seem equally interesting to me, but one is a lot easier to replicate, and the other is easier to lie about.
If you solve a problem that had been around for a while and LLMs offer a new way of approaching it, then it can definitely become a paper.
Of cause one has to verify in sophisticated experiments, that this approach is stable.
It's not like most papers are much above that anyway...
LLM research is currently in its infancy, because they are no older than a few years old. And a research field in its infancy is bound to have a few noteworthy "no sh*t, Sherlock" papers that would be obvious from hindsight.
The fact is, LLMs are a higher-order construct in machine learning, much like a fish is higher-order than a simple cellular colony. Lower-order ML constructs do not demonstrate emergent capabilities like step by step, stream of consciousness thinking, and so on.
Academics should be less jaded and approach the field with beginner's eyes. Because we are all beginners here.
But I 100% agree with the author, "prompt engineering" is not science, and I'd say it's not engineering either. All you're doing is exploring the parameter space of particular model in a very crude way. There is no "engineering" going on in this process, just a bunch of trial and error. Perhaps it should be called "prompt guessing."
None of the results of this process will transfer to any other model. It's simply not science. Papers like "step-by-step" are different, and relate more to learning and inference and do translate to different models and even different architectures.
Also, no, we are not all beginners here. Language models have a long history, and while the very large models are impressive, most of their failings have been known for a very long time already. Things like "prompt engineering" will eventually end up in the same graveyard as "keyword engineers" of the past.
I wonder what your definition of “science” or “engineering” is…
Similarly a kid playing with the dose of water needed to build a sandcastle isn't a civil engineer nor an environmental researcher. Maybe on LinkedIn though.
One way to "engage" these internal representations is to include keywords or patterns of text that make that latent knowledge more likely to "activate". Say, if you want to ask about palm trees, include a paragraph talking about a species of a palm tree (no matter whether it contains any information pertaining to the actual query, so long it's "thematically" right) to make a higher quality completion more likely.
It might not be the actual truth or what's going on inside the model. But it works quite consistently when applied to prompt engineering, and produces visibly improved results.
This sums up pretty nicely why prompt hacking is not science. A scientific theory is related in a concrete way to the mechanism by which the phenomenon being studied works.
But prompt engineering is still a pressure point for some people, despite being wildly more simple and accessible (literally tell the thing to do a thing, and if it doesn't do the thing right, reword)
It feels as though we're getting to the technological equivalent of "what IS art anyways", and questions like if non traditional forms like video games are art (I'm thinking all the way up the chain to even say, Madden games)
And in my experience, when something is under constant questioning of whether or not it even counts as X, Y or Z, it usually can technically qualify, but...
If people are constantly debating whether or not it's even X, it's probably just not impressing people who don't engage in it, as opposed to "traditional" concepts of engineering and art, and part of the impression made comes from the investment and irreplaceable skillsets, things few, if anyone else at the time could have done.
This is why taping a banana on the wall is definitely technically art, but not many outside the art community that tapes bananas to walls really think much of it. It's so mundane and accessible a feat that it doesn't garner much merit to passerbys. It's art by the loosest technical definition, and is giving a lot of credit for a small amount of effort anyone could've done.
Admittedly "prompt engineering" is definitely less accessible than a roll of duct tape and a banana but I think we used to just call it "writing/communication", but I guess those who feel capable at that, often just do it manually anyways.
"Trust the science"
In reality, that is about the furthest from what you should do. As Feynman once said: "Science is the belief in the ignorance of experts". Electricity was also once considered a toy and good for nothing but parlor tricks.
Yes. I come to think of prompt engineering as, in a sense, doing an approximate SELECT query on the latent behavioural space (excuse my lack of proper terminology, my background in ML is pretty thin) that can be thought of as "fishing out" the agent/personality/simulator that is most likely to give you the kind of answer you want. Of course a prompt is a very crude way to explore this space, but to me this is a consequence of extremely poor tooling. For one, llama.cpp now has negative prompts, while the GPT-4 API will probably never have them. So we make-do with the interface available.
> There is no "engineering" going on in this process, just a bunch of trial and error. Perhaps it should be called "prompt guessing."
That is incorrect. It is true that there is a lot of trial and error, yes. But it's not true that it's pure guessing either. While my approach can be best described as a systematic variant of vibe-driven development, at its core it's quite similar to genetic programming. The prompt is mutable, and it's efficacy is possible to evaluate at least in a qualitative sense vs the last version of the prompt. By iterative mutation (rephrasing, restructuring/refactoring the whole prompt, changing out synonyms, adding or removing formatting, adding or removing instructions and contextual information), it is possible to iterate from a terrible initial prompt to a much more elaborate prompt that gets you 90-97% of the way towards nearly exactly what you want to do, by combining the addition of new techniques with subjective judgement on how to proceed (which is incidentally not too different from some strains of classical programming). On GPT-4, at least.
> None of the results of this process will transfer to any other model.
Is that so? Yes, models are somewhat idiosyncratic, and you cannot just drag and drop the same prompt between them. But, in my admittedly limited experience of cross-model prompt engineering, I have found that techniques which helped me to achieve better results with the untuned GPT-3 base model, also helped me greatly with the 7B Llama 1 models. I hypothesise that (in the absence of muddling factors like RLHF-induced censorship of model output), similarly sized models should perform similarly on similar (not necessarily identical) queries. For the time being, this hypothesis is impossible to test because the only realistic peer to GPT-4 (i. e. Claude) is lobotomised to the extent where I would outright pay a premium to not have to use it. I have more to say on this, but won't unless you ask in the interests of brevity.
> Language models have a long history, and while the very large models are impressive, most of their failings have been known for a very long time already. Things like "prompt engineering" will eventually end up in the same graveyard as "keyword engineers" of the past.
Language models have a long history, but a Markov chain can hardly be asked to create a basic Python client for a novel online API. I will also dispute the assertion that we know the "failings" of large language models. Several times now, previously "impossible" tasks have been proven eminently possible by further research and/or re-testing on improved models (better-trained, larger, novel fine-tuning techniques, etc). I am far from being on the LLM hype train, or saying they can do everything that optimists hope they can do. All I'm saying, is that the academia is doing itself a disservice by not looking at the field as something to be explored with no preconceptions, positive or negative.
> one experiment on one data set with seed picking is not worthy reporting
> Additionally, we all need to understand this is just one good empirical result, now we need to make it useful…
And while I obviously value very much the engineering advances we have seen, the science is still lacking, because not enough people are trying to understand why these things are happening. Although engineering advances are important and valuable, I don't understand exactly why people try so hard to call themselves scientists if they are basically skipping the scientific process entirely.
Everything that went in to creating GPT4 is AI/science or whatever. Probing GPT4 and trying to understand and characterize it is also a very worthy thing to do - else how can it be improved upon? But if making GPT is science, I'd say this stuff is more akin to psychology ;-)
it's akin to evolution - we understand the process - that part is simple. But the output/organisms we have to investigate how they work.
Of course we understand how they work, we built them! There is no mystery in their mechanisms, we know the number of neurons, their connectivity, everything from the weights to the activation functions. This is not a mystery, this is several decades of technical developments.
> it's akin to evolution - we understand the process - that part is simple.
There is nothing simple about evolution. Things like horizontal gene transfer is very much not obvious, and the effect of things like environment is a field of active research.
> But the output/organisms we have to investigate how they work.
There is a fundamental difference with neural networks here: there are a lot of molecules in an animal’s body about which we have no clue. Similarly, we don’t know what a lot of almost any animal’s DNA encodes. Model species that are entirely mapped are few and far between. An artificial neural network is built from simple bricks that interact in well defined ways. We really cannot say the same thing about chemistry in general, much less bio chemistry.
The discovery of DNA’s structure was heralded as containing the same explanatory power as you describe here.
Turns out, the story was much more complicated then, and is much more complicated now.
Anyone today who tells you they know why LLMs are capable of programming, and how they do it, is plainly lying to you.
We have built a complex system that we only understand well at a basic “well there are weights and there’s attention, I guess?” layer. Past that we only have speculation right now.
Not at all. It's like saying that since we can read hieroglyphics we know all about ancient Egypt. Deciphering DNA is tool to understand biology, it is not that understanding in itself.
> Turns out, the story was much more complicated then, and is much more complicated now.
We are reverse engineering biology. We are building artificial intelligence. There is a fundamental difference and equating them is fundamentally misunderstanding both of them.
> Anyone today who tells you they know why LLMs are capable of programming, and how they do it, is plainly lying to you.
How so? They can do it because we taught them, there is no magic.
> We have built a complex system that we only understand well at a basic “well there are weights and there’s attention, I guess?” layer. Past that we only have speculation right now.
Exactly in the same way that nobody understand in detail how a complex modern SoC works. Again, there is no magic.
That's absolute BS. Every part of a SoC was designed by a person for a specific function. It's possible for an individual to understand - in detail - large portions of SoC circuitry. How any function of it works could be described in detail down to the transistor level by the design team if needed - without monitoring its behavior.
Yeah, no. I mean, we can’t introspect the system to see how it actually does programming at any useful level of abstraction. “Because we taught them” is about as useful a statement as “because its genetic parents were that way”.
No, of course it’s not magic. But that doesn’t mean we understand it at a useful level.
How come we don’t entirely understand biology then?
Chemistry is indeed applied QED ;) (and you don't need massive numbers of particles to have very complex chemistry)
> How come we don’t entirely understand biology then?
We understand some of the basics (even QED is not reality). That understanding comes from bottom-up studies of biochemistry, but most of it comes from top-down observation of whatever there happens to be around us. The trouble is that we are using this imperfect understanding of the basics to reverse engineer an insanely complex system that involves phenomena spanning 9 orders of magnitude both in space and time.
LLMs did not spawn on their own. There is a continuous progression from the perceptron to GPT-4, each one building on the previous generation, and every step was purposeful and documented. There is no sudden jump, merely an exponential progression over decades. It's fundamentally very different from anything we can see in nature, where nothing was designed and everything appears from fundamental phenomena we don't understand.
As I said, imagining that the current state of AI is anything like biology is a profound misunderstanding of the complexity of both. We like to think we're gods, but we're really children in a sand box.
I think you have missed my point by focusing on biology as an extremely complex field.e, it was my mistake to use it as an example in the first place. We don’t need to go that far;
sure, llms did not spawn on their own. They are a result of thousands of years of progress in countless fields of science and engineering. Like any modern invention, essentially.
Here I remember to make sure we are on the same page on what we’re discussing - as I understand, whether “prompt engineering” can be considered an engineering/science practice. Personally I haven’t considered this enough to form an opinion but your argument does not sound convincing to me;
I guess your idea of what llms represent matters here. The way I see it, in some abstract sense we are as society exploring a current peak - in compute $ or flops and performance on certain tasks - of a rather large but also narrow family of functions. By focusing our attention on functions composed of ones we understood how to effectively find parameters for, we were able to build at this point rather complicated processes for finding parameters for the compositions.
Yes, the components are understood, at various levels of rigor, but the thing produced is not yet sufficiently understood. Partly out of cost to reproduce such research, and partly due to complexity of the system, a driver for the cost.
The fact that “prompt engineering” as a practice and that companies supposedly base their business model on secret prompts is a testament, for me, to the fact they are not well understood. A well understood system you design has a well understood interface.
Now, I haven’t noticed a specific post OP was criticizing so i take it his remarks were general. He seems to thinks that some research is not worth publishing. I tend to agree that I would like research to be of high quality, but that is subjective. Is it novel? is it true?
Now, progress will be progress and im sure current architectures will change and models will get larger. And it may be that a few giants are the only one running models large enough to require prompt engineering. Or we may find a way to have those models understand us better than a human ever could. Doubtful. And post singularity anyway, by definition.
In either case yes, probably temporary profession. But in case open research will continue in those directions as well, there will be need for people to figure out ways to communicate effectively with these. You dismiss them as testers.
However, progress in science and engineering is often driven by data where theory is lacking and I’m not aware of the existence of deep theory as of yet. eg something that would predict how well a certain architecture would perform. Engineering ahead of theory, driven by $).
As in physics that we both mentioned, knowing the component part does not automatically grant you understanding of the whole. knowing everything there is to know about the relevant physical interaction, protein folding was a tough problem that AFAIR has had a lot of success with tools from the field. Square in the realm of physics even, and we can’t give good predictions without testing (computationally).
If someone tested some folding algorithm and visually inspected results, then found a trick how to consistently improve on the result in some subcase of proteins. Would that be worthy of publishing? if yes, why is this different? if not, why not?
Even if you understand evolution - you still don't understand how the human body or mind works. That needs to be investigated and discovered.
In the same way, you understanding how these models were trained doesn't help you understand how the models work. That needs to be investigated and discovered.
upd these statements from me are so controversial, the number of "points" just dances lambada. The psy* areas are clearly polarized: some guys upvote all my messages in this topic and some other ones downvote all my messages in this topic. This is a sign of something interesting but I am not ready to elaborate on this statement in this comment which is going to become [flagged] eventually.
Yeah, you might understand it if you study psychology or perhaps sociology.
Yet you're being a reply guy all over this thread, might as well just elaborate you clearly have the time and interest
Could you elaborate on this statement?
You might retort here "ah well, 'nature' is just the word we use when we speak of observable phenomena in the hard sciences, its not muddied by religion like that crock stuff psychology."
And then I would say, "ok, if 'nature' is just observable phenomena, what is the aim or purpose of the hard sciences? If it is all just observing/experimenting on discrete phenomena, there would be nothing we could do or conclude from the rigor of physics."
You laugh at my insanity (well, if you believed in such a thing): "But we do conclude things from physics, because experiments are reproducible, and with their reproducibility we can gain confidence in generalizing the laws of our universe."
And yes! You would be correct here. But now all the sudden you have committed physics to something just as fundamentally "spiritual" as the soul: that the universe is sensible, rational, and "with laws." Which is indeed just speaking the very same mystical "nature" of ancient Greece from which we get phys-.
But this need not be some damning critique of physics itself (like psychology), and rather, can lead to a higher level understanding of all scientific pursuits: that we are everywhere cursed by a fundamental incompleteness, that in order even to enter into scientific pursuit we must shed an absolute skepticism for a qualified one. Because this is the only way we accumulate a network of reinforced hypotheses and conceptions, which do indeed help us navigate the purely phenomenal world we are bound in.
The flip side of the coin is that I was in a really high quality hospital, I'm sure there are hospitals or facilities that can be more harmful rather than helpful.
I also have a problem with the way that they treat mental health like cancer, that once you have a diagnosis you will always have it. There are zero diagnostic criteria for "fully recovered" or removing dependence on medication, even after 5 or 10 years. It's also treated like a scarlet letter for insurance and unrelated things like TSA pre check - no matter how well you are doing you are still some level of risk to yourself and society. Though I could be wrong... the reoccurrence chart over time for my specific acute mania (with no depressive episodes) does look a lot like cancer remission charts with asymptotic approach to 80%+ reoccurrence after 2-4 years.
So what is a part of organism (if not brain) which might be possible for psy* specialist to heal, is it an arm or a leg or a spine?
As a matter of fact, I did a project on the normalization of the text, e.g., translate "crossing of 6 a. and 12 s." into "crossing of sixth avenue and 12-th street" with a simple LM (order 3) and beam search on the lattice paths, lattice formed with hypotheses' variants. I got two fold decrease of word error rate compared to simpler approach with just outputting the most probable WFST path. It was not "step by step stream of consciousness," but nevertheless very impressive feat, when system started to know more without much effort.
The large LM's do not just output "most probable" token, they output most probable sequence of tokens and it is done with the beam search.
As you can see, my experience tells me that beam search alone can noticeably, if not tremendously, improve quality of the output, even for very simple LMs.
And if I may, the higher-order construct here is a beam search, not the LMs-as-matrix-coefficients' themselves. Beam search is used in speech recognition for decades now, SR does not work properly without it. LM's, apparently, also do not work without it.
Beam search at Wikipedia: https://en.wikipedia.org/wiki/Beam_search
Beam search in Sqlite: https://www.sqlite.org/queryplanner-ng.html#_a_difficult_cas...
Beam search is more interesting than its' application within AI field.
The field of ML suffered from a problem where there were more entrants to the field than available positions/viable work. In many industrial positions, it was possible to hide a lack of progress behind ambiguity and/or poor metrics. This lead to a large amount of gate keeping for productive work as ultimately there wasn't enough to go around in the typical case.
This attitude is somewhat pervasive leading to blogs like the above. Granted, the Nth prompting paper probably isn't interesting - but new programming languages for prompts and prompt discovery techniques are very exciting. I wouldn't be surprised if it turned out that automatic prompt expansion using a small pre-processing model turns out to be an effective technique.
I'd agree the former won't get you a complete anything (longer than ~30 lines) by itself (90-95% cool, but with some incredible errors in the other 5-10%).
I'd also agree that the latter is worthy of publishing.
Instinctively a prompt may look like one line of code. We can often know or prove what a compiler is doing, but higher dimensional space is just not understood in the same way.
What other engineered thing in history has had this much immediately useful emergent capability? Genetic algorithms finding antenna designs and flocking algorithms are fantastic, but I would argue narrower in scope.
Of course a paper is still expected to expand knowledge, to have rigor and impact, but I don’t see why a prompt centric contribution would inherently preclude this.
What makes these things "emergent capabilities"? They seem like pretty straightforward consequences of autoregressive generation. If you feed output back as input then you'll get more output conditioned on that new input and stream of conscious generation is just stochastic parroting isn't it?
Strictly speaking, it might be "stochastic parroting". But really, if you want to be a great and supremely effective stochastic parrot, you have to learn an internal representation of certain things so that you can predict them. And there are hints that this is exactly what a sufficiently-large large language model is doing.
Would GPT only with pretraining and no fine tuning be able to behave better when prompt is "let’s think step by step"?
Or this prompt only worked because a fine tuning dataset containing many "let’s think step by step" prompt was used?
LM research is also old, papers using very shitty LMs (I.e Markov chains) and discussing them have existed since the early 2000s and likely before.
Check yourself before you try to check others.
AI/ML resembles more alchemy than e.g. physics, putting stuff into a pot and seeing what comes out. A lot of the math in those papers doesn't provide anything but some truisms, most of it is throwing stuff at the wall and seeing what sticks.
Looking forward to "I made this cool thing, here's the code/library you can use" rather than the papers/gatekeeping/ego stroking/"muh PhD".
Think if Google had built an AI team around the former rather than the latter, they wouldn't have risked the future of their entire company and squandered their decade head start.