Sorry, but a new prompt for GPT-4 is not a paper
279 points
1 year ago
| 38 comments
| twitter.com
| HN
H8crilA
1 year ago
[-]
If you do enough measurements on that new prompt then I don't see why this shouldn't be a paper. People overestimate the value of "grand developments", and underestimate the value of actually knowing - in this case actually knowing how well something works, even if it is as simple as a prompt.

Compare with drug trials: Adderall only differs from regular amphetamine in the relative concentration of enantiomers, and the entire value of the drug is in the measurements.

reply
starbugs
1 year ago
[-]
> Compare with drug trials: Adderall only differs from regular amphetamine in the relative concentration of enantiomers, and the entire value of the drug is in the measurements.

Drug trials may be expected to be somewhat reproducible.

What I don't get is how it can even be called research if it cannot be expected to be reproducible at all!

GPT is a closed source/weights, proprietary product that changes every couple of weeks or so. How can you expect a prompt to do the same for a reasonable length of time for the research to be even rudimentarily reproducible? And if it's not reproducible, what is it actually worth? I don't think much. Could as well have been a fault in the research setup or a fake.

reply
IanCal
1 year ago
[-]
> GPT is a closed source/weights, proprietary product that changes every couple of weeks or so.

Do you have any evidence that the weights for versioned models are being changed without notifications?

reply
starbugs
1 year ago
[-]
> Do you have any evidence that the weights for versioned models are being changed without notifications?

I think in a real scientific process, it's upon those who claim that they are not to provide the evidence.

reply
IanCal
1 year ago
[-]
I'm sorry, but that's entirely ridiculous. You're mangling up a concept of burden of proof here.

You can easily see this because it can be flipped around easily - you made a claim that they are being changed, even every few weeks! Should it really be on me to show that your very specific claim is false?

Aside - but even if the model weights did change, that wouldn't stop research being possible. Otherwise no drug trial could be replicated because you couldn't get the exact same participants at the exact same age.

reply
starbugs
1 year ago
[-]
> You can easily see this because it can be flipped around easily - you made a claim that they are being changed, even every few weeks! Should it really be on me to show that your very specific claim is false?

Wait a minute? The author of such a paper makes a claim about some observation that's based on the assumption that the studied model is defined in a way. I am disputing that claim since no evidence has been shown that it is defined because no definition has been given.

If your twist on this issue would be true, then I would, by definition, have to accept everything that they claim as true without any evidence. That's not called science. That's called authority.

reply
IanCal
1 year ago
[-]
> I am disputing that claim

You are entirely within your rights to say that the authors have assumed that openai is not lying about their models. They've probably also assumed that other paper authors are not lying in their papers.

You then say however:

> GPT is a closed source/weights, proprietary product that changes every couple of weeks or so.

And when I ask for evidence of this very specific claim, you turn around and say the burden is on me to show that you're lying. That is what is butchering the concept of burden of proof.

> If your twist on this issue would be true, then I would, by definition, have to accept everything that they claim as true without any evidence.

Absolutely not.

reply
starbugs
1 year ago
[-]
Look, the burden of proof in a scientific paper is on the authors. Not on me.

A company with a proprietary product that says something is not acceptable evidence in a scientific context. No need to allege that anyone is lying. Lying is irrelevant. What's relevant is that the research is falsifiable. It cannot be falsifiable if you don't know what the actual model is at a given point in time.

reply
dartos
1 year ago
[-]
Apples and oranges comparison.

You couldn’t get the same participants, but you could get the same drugs. If you could get identical participants, that wouldn’t be very helpful since humans are so varied.

But for GPT based papers, what you’re actually testing could change without you knowing. There’s no way to know if a paper is reproducible at all.

If you can’t reproduce results, is it really research, or just show and tell?

reply
IanCal
1 year ago
[-]
> If you can’t reproduce results, is it really research, or just show and tell?

You can't start with a statement about clinical trials not being perfectly reproducible and that's fine, then say this.

> what you’re actually testing could change without you knowing

If people are lying about an extremely important part of their product, which they have little reason to. But then this applies to pretty much everything. Starting with the assumption that people are lying about everything and nothing is as it seems may technically make things more reproducible but it's going to require unbelievable effort for very little return.

> There’s no way to know if a paper is reproducible at all.

This is a little silly because these models are available extremely easily and at a pay-as-you-go pricing. And again, it requires an assumption that openai is lying about a specific feature of a product.

reply
starbugs
1 year ago
[-]
> You can't start with a statement about clinical trials not being perfectly reproducible and that's fine, then say this.

Nobody said that to begin with. Re-read their comment.

> If people are lying about an extremely important part of their product [...]

Nobody is alleging that anyone is lying. It's just that we cannot be sure what the research actually refers to, because of the nature of a proprietary/closed model.

> This is a little silly because these models are available extremely easily and at a pay-as-you-go pricing.

What does this have to do with the parent comment? I don't think it's appropriate to call anyone here silly, just because you don't like their comment and don't have good counter arguments.

reply
IanCal
1 year ago
[-]
> Nobody is alleging that anyone is lying.

Let's be clear, you have made an explicit claim that openai are lying.

> What does this have to do with the parent comment?

Because many other fields would kill for this level of reproducibility, grab an API key, spend a few quid running a script and you can get the results yourself.

reply
mewpmewp2
1 year ago
[-]
With API you can choose versions fixed to a date. Are you suggesting that OpenAI is lying about these being fixed to a date?

Why would they lie about it?

The whole point of these versions is so that when you build on top of that it would keep working as you expect.

reply
hnfong
1 year ago
[-]
I'm not saying they lie about it, but as a hypothetical there could be many reasons to lie.

- realizing their model leaks confidential information against malicious prompts

- copyright claims against them forcing them to remove bits of data from the training set

- serious "alignment" bugs that need to be fixed

- vastly improved optimization techniques that slightly affect results in 0.1% of the cases

If updating the model would save the company a couple hundred million dollars, they might want to do it. And in some of the cases, I can imagine they have an incentive to keep the update low key.

reply
sebzim4500
1 year ago
[-]
If I book telescope time and capture a supernova then no one will ever be able to reproduce my raw results because it has already happened. I don't see why OpenAI pulling old model snapshots is any different.
reply
starbugs
1 year ago
[-]
> If I book telescope time and capture a supernova then no one will ever be able to reproduce my raw results because it has already happened. I don't see why OpenAI pulling old model snapshots is any different.

That's why you capture multiple of them and verify your data statistically?

reply
sebzim4500
1 year ago
[-]
And ideally if someone is proposing new prompting techniques they should test it across both the most capable models (which are unfortunately proprietary) and the best open models.

The problem is that what works on small LLMs does not necessarily scale to larger ones. See page 35 of [1] for example. A researcher only using the models of a few years ago (where the open models had <1B parameters) could come to a completely incorrect conclusion: that language models are incapable of generalising facts learned in one language to another.

[1] https://arxiv.org/pdf/2308.03296.pdf

reply
doctorpangloss
1 year ago
[-]
While this is very interesting, there are enough differences between astronomy and whatever papers this Twitter user is talking about that it's not the insight porn you think it is.

The Twitter user doesn't even reference a single specific paper, kind of doing some hand wavy broad generalizations of his worst antagonists. So who really knows what he's talking about? I can't say.

If he means papers like the ones in this search - https://arxiv.org/search/?query=step+by+step+gpt4&searchtype... - they're all kind of interesting, especially https://arxiv.org/abs/2308.06834 which is the kind of "new prompt" class he's directly attacking. It is interesting because it was written by some doctors, and it's about medicine, so it has some interdisciplinary stuff that's more interesting than the computer science stuff. So I don't even agree with the premise of what the Twitter complainer is maybe complaining about, because he doesn't name a specific paper.

Anyway, to your original point, if we're comparing the research I linked and astronomy... well, they're completely different, it is totally intellectually dishonest to compare the two. Like tell me how I use astronomy research later in product development or whatever? Maybe in building telescopes? How does observing the supernova suggest new telescopes to build in the future, without suggesting that indeed, I will be reproducing the results, because I am building a new telescope to observe another such supernova? Astronomy cares very deeply about reproducibility, a different kind of reproducibility than these papers, but maybe more the same in interesting ways than the non-difference you're talking about. I'm not an astronomer, but if you want to play the insight porn game, I'd give these people a benefit of the doubt.

reply
kjkjadksj
1 year ago
[-]
But you know the parameters of your telescope at least. If openai wants to update all the time, fine, but then they should be like how every other piece of research software works, where you can list what exact version of software you used and pull that version yourself if need be.
reply
IanCal
1 year ago
[-]
Stability is the purpose of the versioned models.
reply
arter
1 year ago
[-]
Not always but we can reproduce your findings in the future - credit to gravitational lensing causing some light paths to years longer to reach us.
reply
awestroke
1 year ago
[-]
You can select a static snapshot that presumably does not change, if you use the API
reply
starbugs
1 year ago
[-]
> You can select a static snapshot that presumably does not change, if you use the API

Sorry, I won't blindly believe a company who are cynical enough to call themselves "OpenAI", then publish a commercial closed source/weights model for profit.

Evidence that they do not change without notice or it didn't happen. Better even, provide the source and weights for research purposes. These models could be pulled at every instant if the company sees fit or ceases to exist.

reply
cqqxo4zV46cp
1 year ago
[-]
Yeah, here it comes. In these conversations you don’t need to ask very many “why”s before it just turns out that the antagonist (you) has an axe to grind about OpenAI, and has added that the their misplaced sense of expertise with regard to the typical standards of proof in academic publications.
reply
starbugs
1 year ago
[-]
> Yeah, here it comes. In these conversations you don’t need to ask very many “why”s before it just turns out that the antagonist (you) has an axe to grind about OpenAI, and has added that the their misplaced sense of expertise with regard to the typical standards of proof in academic publications.

Seems to have hit hard?

I would find it borderline acceptable being offended by a user whose name has obviously been generated using a password generator if you could at least provide some substance to the discussion. Just labeling someone and questioning their competence based on your hurt feelings is a bit low. Please improve.

reply
nradov
1 year ago
[-]
Is that a contractual guarantee, or more of a "trust us" kind of thing?
reply
aleph_minus_one
1 year ago
[-]
> People overestimate the value of "grand developments", and underestimate the value of actually knowing - in this case actually knowing how well something works, even if it is as simple as a prompt.

I think this depends a lot on the "culture" of the subject area. For example in mathematics, it is common that only new results that have been thoroughly worked through are typically "publish-worthy".

reply
brookst
1 year ago
[-]
Wouldn’t the “thoroughly worked through” part be analogous to extensive measurements of a prompt?
reply
aleph_minus_one
1 year ago
[-]
Let me put it this way: you can expect that a typical good math paper means working on the problem for, I would say, half a year (often much longer). I have a feeling that most papers that involve extensive measurements of prompts do not involve 1/2 to 1 year of careful

- hypothesis building

- experimental design

- doing experiments

- analyzing the experimental results

- doing new experiments

- analyzing in which sense the collected data support the hypothesis or not

- ...

work.

reply
VoodooJuJu
1 year ago
[-]
There's a great lesson here for marketers: the prospect can be convinced with the simple presence of graphs and data and measurements.

Even just the mere presence of data and data visuals is enough to legitimize what you're selling in the eyes of the prospect. When the prevailing religion is Scientism, data bestows that blessing of authority and legitimacy upon whatever it is you're trying to sell. Show and tell whatever conclusions you'd like from the data - the soundness of the logic supporting that conclusion is irrelevant. All that matters is you did the ritual of measuring and data-gathering and graph-ifying and putting it on display for the prospect.

There's a great book, How to Lie with Statistics, that covers this particular case, but demonstrates other popular ways in which data and data visuals are manipulated to sell things.

reply
kridsdale1
1 year ago
[-]
Having worked at famously data driven Meta and Google, this is 100% accurate.

You can turbo boost your career by mastering the art of “data ritual”. It doesn’t matter what the results are or magnitude of impact or what it cost to build and launch something. Show your results in a pretty way that looks like you did your diligence and you will be celebrated.

reply
oasisbob
1 year ago
[-]
> and the entire value of the drug is in the measurements.

I'm not sure this is true.

While modern Adderall has a closely controlled mixture of multiple enantiomers, it hasn't always been this way.

Medicine historically didn't care nearly as much about racemic mixtures, and the possibility of stereo toxicity (eg Thalidomide).

Many drugs in modern human history, including mixed amphetamine salts, have been marketed with very little concern for racemic purity.

reply
nathanfig
1 year ago
[-]
Agreed. People publish papers on algorithms all the time, imagine saying "Sorry, but new C++ is not a paper". There is a ton of space to be explored wrt prompts.

If you do the rigor on why something really is interesting, publish it.

reply
wongarsu
1 year ago
[-]
I feel this has nothing at all to do with LLMs and more to do with academic incentives in general. Focusing on quality over quantity won't advance your career. Publishing lots of new papers will, as long as they meet the minimum threshold to be accepted into whatever journal or conference you are aiming for. Having one good paper won't increase your h-score, three mediocre papers might.

Doubly so when there's a new breakthrough, where one of your low-effort papers might end up being the first saying something obvious that ends up being really important. Because then everyone will end up quoting your paper in perpetuity.

reply
zitterbewegung
1 year ago
[-]
Being dismissive about this tweet or agreeing with the author is one thing. Not realizing that the absolute minimum of a scientific paper can be much lower than a new prompt for GPT-4 is what everyone should be aware of.
reply
mensetmanusman
1 year ago
[-]
It is a _paper_, but it's not science, since GTP-4 is closed source and thus not reproducible in a lab.

If OpenAI disappears tomorrow, papers are GTP-4 will likely be of little to no value, which is another tell of a non-scientific exploration.

(note: not all explorations are scientific, and that is great! Science is just one of many tools for exploring lived reality.)

reply
MattRix
1 year ago
[-]
That’s like saying a biologist studying an endangered species isn’t doing science because the animal could disappear tomorrow. The permanence of a subject has no bearing on whether it is science or not.

The idea that science has to happen in a lab is of course absurd as well.

reply
nicce
1 year ago
[-]
> That’s like saying a biologist studying an endangered species isn’t doing science because the animal could disappear tomorrow. The permanence of a subject has no bearing on whether it is science or not. The idea that science has to happen in a lab is of course absurd as well.

The main point here is that anyone can likely start studing those endangered species and try to reproduce the results while in GPT4 it is not possible at all. The lab point is related to fact that we are talking about the software here.

reply
roguas
1 year ago
[-]
Whats not possible? Do they ban people for exploring sota model that they offer?
reply
nicce
1 year ago
[-]
That is not reproducible research. To reproduce the research, you need training data, source code and all the parameters.

With black-boxed generated models, there is no way to tell how they have been actually generated. It has no value for science to how to improve it further.

reply
dncornholio
1 year ago
[-]
I compare a paper on a GPT-4 prompt to a tutorial on how to use Photoshop. It's not science IMO.
reply
mmcwilliams
1 year ago
[-]
In the case of an endangered species a biologist would still have access to take samples from it and inspect it. Science doesn't have to happen in a lab but it's questionable to call something science when it involves hitting a black box endpoint which can change the underlying models and behaviors at a whim.
reply
EForEndeavour
1 year ago
[-]
What is science? Can you not apply it to artifacts whose inner workings are hidden?

As as aside, I was surprised at the repeated misspelling of GPT-4 and take it as a heuristic that this comment was likely written by a real human :)

reply
mensetmanusman
1 year ago
[-]
Aha, didn't know Guanosine-5'-triphosphate (GTP) (a purine nucleoside triphosphate) was part of my ios dictionary, good catch!
reply
Der_Einzige
1 year ago
[-]
While I think the twitter post author is being a bit of an ass, they’re sort of right about the overvaluing we’ve put on simply better prompts. I wrote an opinionated GitHub gist about this exact issue:

https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c...

I do the whole NLP publishing thing and I’ve hesitated to “write a paper” about applying techniques already known and used everywhere in the stable diffusion community to NLP models. That said, the AI community loves to pretend like they discovered something, such as a recent paper purporting to be the first to do “concept slider” Lora’s, despite these existing for months before that work was published on Civit.ai. The authors of course didn’t cite those already existing models.

Everyone chasing citations and clout hard right now because these professors and researchers realize that they only have 5-10 years before AI eats their jobs and most other white collar jobs. I don’t blame them. I want my mortgage paid off before I’m automated away!

reply
glitchc
1 year ago
[-]
The current scientific research apparatus is more about being first than about being correct or thorough. A paper that gets out early means more citations, and many of the faculty sit on the editorial boards, and are able to suggest/enforce specific citations during the review process. Academics aren't fully to blame for this, it's just how the incentives are set up in the system. Tenure and promotions are increasingly based on h-index; a measure of impact based largely on the number of citations.
reply
rollcat
1 year ago
[-]
It's hard to estimate the impact of an idea in the same way that you can estimate the impact of an investment (stock is a number that goes up or down). You're right that the current incentive system might be to blame, but a simplistic metric will be gamed just as easily - what would you propose?
reply
glitchc
1 year ago
[-]
Honestly, I don't think any metric can fix it, and don't have any easy solutions to this. The problem is larger than academia, more endemic to society. The root cause is society's values have changed. Previously, prestige mattered for something. Now, people would rather listen to pop stars than learned individuals, and wealth is the only metric that matters.

As a result, typical professions that used to confer prestige, and for which prestige was supposed to be just reward, such as a professor, a medical doctor, a judge, are now mainly pursued for pecuniary reasons (money, job security). And because they're not doing it for prestige, they don't necessarily care about being right/correct. Playing the game to maximize the revenue streams is paramount. I happen to know a number of faculty who are quite proud of their multiple revenue streams. This would be unthinkable for an academic 50 years ago.

reply
lmeyerov
1 year ago
[-]
To bring some data to a sour grapes fight: https://paperswithcode.com/sota/code-generation-on-humaneval

For code generation, GPT4 is getting beat by the small prompt library LATS wrapped around GPT3.5. Given the recent release of MagicCoder / Instruct-OSS, that means a small prompt library + a small 7B model you can self-host beats the much fancier GPT4.

Similar to when simple NNs destroyed a decade of Bayesian modeling theses & research programs, it's frustrating for folks going other paths. But it doesn't make the work 'wrong'.

reply
harlanlewis
1 year ago
[-]
> GPT4 is getting beat by the small prompt library LATS wrapped around GPT3.5

The link you shared doesn’t quite reflect this. Omitting other models…

LATS (gpt-4): 94.4 Reflexion (gpt-4): 91.0 gpt-4: 86.6 … LATS (gpt-3.5): 83.8 … zero-shot (gpt-4): 67.0 zero-shot (gpt-3.5): 48.1

I’m not quite sure how to translate leaderboards like these into actual utility, but it certainly feels like “good enough” is only going to get more accessible and I agree with what I think is your broader point - more sophisticated techniques will make small, affordable, self-hostable models viable in their own right.

I’m optimistic we’re on a path where further improvement isn’t totally dependent on just throwing money at more parameters.

reply
lmeyerov
1 year ago
[-]
Ah you're right, LATS GPT3.5 is 84 while standalone GPT4 is 87

Given standalone GPT3.5 is "just" 48.. it's less about beating and more about meeting

RE:Good Enough & Feel... very much agreed. I find it very task dependent!

For example, GPT4 is 'good enough' that developers are comfortable copy-pasting & trying, even vs stack overflow results. We haven't seen LATS+MagicCoder yet, but as MagicCoder 7b already meets+exceeds GPT3.5 for HumanEval, there's a plausible hope for agent-aided GPT4-grade tools being always-on for all coding tasks, and sooner vs later. We made that bet for Louie.AI's interactive analyst interface, and as each month passes, evidence mounts. We can go surprisingly far with GPT3.5 before wanting to switch to GPT4 for this kind of interaction scenario.

Conversely... I've yet to see a true long-running autonomous coding autoGPT where the error rate doesn't kill it. We're experimenting with design partners on directions here -- think autonomous investigations etc -- but there's more on the advanced fringe and with special use cases, guard rails, etc. For most of our users and use cases... we're able to more reliably deliver -- today -- on the interactive scenarios with smaller snippets.

reply
spaceywilly
1 year ago
[-]
This right here. I feel like the focus on just throwing more GPU at the problem is a mistake many of these companies are making at the moment. The real breakthroughs will come when we figure out how to use the current models and compute power more efficiently. If it’s prompt engineering that leads to this breakthrough, so be it.
reply
siva7
1 year ago
[-]
reply
jatins
1 year ago
[-]
Can a person just go upload anything on arxiv or is there a review process around these things?

What I am really asking is "what makes something a paper and not a blogpost"?

reply
forgotpwd16
1 year ago
[-]
ArXiv is kinda reputation-based. That is to submit something you need to be endorsed, either done automatically, based on institution, or asking established authors. After being endorsed to a subject area, you can submit freely to it, keeping in mind:

>Submissions to arXiv are subject to a moderation process that classifies material as topical to the subject area and checks for scholarly value. Material is not peer-reviewed by arXiv - the contents of arXiv submissions are wholly the responsibility of the submitter and are presented “as is” without any warranty or guarantee.

reply
MzxgckZtNqX5i
1 year ago
[-]
"Articles" on arXiv are not peer-reviewed, they just check whether it looks like it belongs to one of the categories they hosts:

"Registered users may submit articles to be announced by arXiv. There are no fees or costs for article submission. Submissions to arXiv are subject to a moderation process that classifies material as topical to the subject area and checks for scholarly value. Material is not peer-reviewed by arXiv - the contents of arXiv submissions are wholly the responsibility of the submitter and are presented “as is” without any warranty or guarantee." [0]

They are commonly known as pre-prints, in a similar fashion to IACR ePrint [1] for cryptography.

[0]: https://info.arxiv.org/about/index.html

[1]: https://eprint.iacr.org/

reply
Kelkonosemmel
1 year ago
[-]
How to add prompt knowledge into research? By having papers about it.

Shouldn't be the tooling around it good enough that a few prompt papers don't overload the system?

reply
elif
1 year ago
[-]
Nah, this is just an early example of many "this is too easy it doesn't count" defensive human arguments against AI.

Parallel to the "you use copilot so your code quality is terrible and you don't really even understand it so it's not maintainable" human coping we are familiar with.

If there is any shred of truth to these defenses, it is temporary and will be shown false by future, more powerful AI models.

Consider the theoretical prompt that allows one of these models to rapidly improve itself into an AGI. Surely you'd want to read that paper right?

reply
daveguy
1 year ago
[-]
No prompt will cause an LLM to rapidly improve itself, much less into an AGI. Prompts don't cause permanent change in the LLM, only differences in output.
reply
elif
1 year ago
[-]
You're talking about how GPT functions in 2023. I am discussing such a point where when LLM outputs become valuable LLM modifications.

AI recursing on itself progressing toward an AGI.

reply
daveguy
1 year ago
[-]
No matter how many times you feed the output of an LLM back to itself, the underlying model does not change. Online training (of actual model weights not just fine-tuning) would be hugely resource intensive and not guaranteed to do any better than the initial training. Interference will happen whether catastrophic or simply drift. We can fantasize about future architectures all day long, but that doesn't make them capable of AGI or even give us a path forward.
reply
Intralexical
1 year ago
[-]
> AI recursing on itself progressing toward an AGI.

Inbreeding LLMs will result in an "AGI"?

In this universe, if you try to get something from nothing, you just end up with noise.

reply
jdefr89
1 year ago
[-]
Id much rather see a PoC…
reply
alphazard
1 year ago
[-]
Developing prompts for these models isn't a science yet. It does seem to meet most of the criteria for an art though.

We recognize some outputs as high quality, and others as low quality, but often can't articulate the exact reason why. It seems that some people are able to reliably produce high quality results, indicating there is some kind of skill involved. More precisely, the quality of an individual artist's last output is positively correlated with the quality of their next output. A kind of imprecise "shop talk" has emerged, self describing as "prompt engineering", which resembles the conversations artists in other mediums have.

For people in tech this will seem most similar to graphic designers. They produce much nicer looking interfaces than lay people can. We often can't explain why, but recognize it to be the case. And graphic designers have their own set of jargon, which is useful to them, but is not scientific.

"Prompt artist" is a better term than "prompt engineer".

reply
DonHopkins
1 year ago
[-]
In the same sense that "Bullshit artist" is a better term than "Bullshit engineer".
reply
rileymat2
1 year ago
[-]
Why isn’t it science? Surely people can use the scientific method in investigating?
reply
alphazard
1 year ago
[-]
For starters we don't have a way to measure quality objectively, and this is the case for art in general. If you were to develop an objective measure of beauty for example, visual art as a discipline would quickly turn into a science. At some level we know that's possible, we're all just brains in jars. But AFAIK we aren't doing science there yet.

The science and engineering parts all have a measure of quality, sometimes that's a human rating, sometimes it's cross-entropy loss. There's nothing stopping someone from using the scientific method to investigate these things, but descriptively I haven't seen anyone, calling themselves a "prompt engineer/scientist", doing that yet.

"I used these words, and I got this output which is nice" sounds like, "I tried using these brushes and I made this painting which is nice". I can agree with the painting being nice, but not that science was used to engineer a nice painting.

reply
rileymat2
1 year ago
[-]
But with prompt escapes and violations of the service expectations, don’t you have an objective criteria?

It gave me X information, yes or no?

reply
WhitneyLand
1 year ago
[-]
Should this be a paper?

https://not-just-memorization.github.io/extracting-training-...

There is supporting analysis and measurement, but the essence is a single type of prompt, and DeepMind is a heavyweight lab I think it’s fair to say.

Moreover there’s evidence people independently reported this result months beforehand on Reddit based on casual observation.

reply
sgt101
1 year ago
[-]
You will notice that they say in the blog post:

"If you’re a researcher, consider pausing reading here, and instead please read our full paper for interesting science beyond just this one headline result. In particular, we do a bunch of work on open-source and semi-closed-source models in order to better understand the rate of extractable memorization (see below) across a large set of models."

So they are trying to rigorously quantify the behaviour of the model. Is this "look mom no hands"... I don't think so.

reply
WhitneyLand
1 year ago
[-]
Sorry for any confusion, my comment was meant to refer to the linked paper:

https://arxiv.org/abs/2311.17035

I think it’s valid work, but the original tweet seems to call a prompt based paper into question.

At least enough to clarify where he would stand on an example like this.

reply
grepLeigh
1 year ago
[-]
Studying the way LLMs behave to different prompts (or different ways of fine-tuning for a set of prompts) is valuable science.

Some of the most interesting papers published this year ("Automatic Multi-Step Reasoning and Tool-Use") compare prompt strategies across a variety of tasks. The results are fascinating, findings are applicable and invite further research in the area of "prompt selection" or "tool selection."

reply
3cats-in-a-coat
1 year ago
[-]
Attacking the participants in a systemic shift is 100% useless as it doesn't target the culprit.

In programming we have a similar phenomenon, that StackOverflow-driven (and I guess now GPT-driven) juniors have overtaken the industry and displaced serious talent. Because sufficient amounts of quantity always beats quality, even if the end result is inferior, this is caused by market dynamics, which operate on much cruder parameters than the sophisticated analysis of an individual noticing everything around them becoming "enshittified".

SO-driven juniors are cheap, plentiful, and easily replaceable. And a business that values less expense and less risk therefore prefers them, because it has no way to measure the quality of the final product with simple metrics.

The same mechanism is driving AI replacing our jobs currently, the avalanche of garbage papers by academics. This is entropy for you. We see it everywhere in modern society down to the food we eat. Quality goes away, replaced by cheap to produce and long shelf life.

If we don't fundamentally alter what the system sees as ACCEPTABLE, and VALUABLE, this process will inevitably continue until our world is completely unrecognizable. And to fundamentally alter the system, we need an impulse that aligns us as a society, startles us into action, all together (or at least significant majority of us). But it seems we're currently in "slowly boiled frog mode".

reply
ilc
1 year ago
[-]
This whole thing reminds me a bit of playing Fallout 4.

There will be good data, the pre-AI enshitification data. The stuff from before the war.

And then... the data after. Tainted by the entropy, and lack of utility of AI.

Alas, this means in some senses, human progress will slow and stop in the tech field if we aren't careful and preserve ways to create pre-AI data. But the cost of it is so high in comparison to post... I'm not sold it will be worth it.

reply
jhbadger
1 year ago
[-]
This paints a rosy picture of human-generated data now. It's not as if most human data is reliable. Even among peer reviewed scientific literature, most of it is crap and it takes effort to find the good stuff. Also, your analogy kind of misses the point of the Fallout games. The pre-war world was awful, filled with evil corporations like Vault-tech and Nuka Cola that murdered and poisoned their customers, and the point of the games is that people need to move on and not idealize the past.
reply
ilc
1 year ago
[-]
There is a reason I didn't draw a hard parallel.

I use AI day to day. I see what it can do and can't.

But when you see pages, and pages, and pages of GPT spam all over the place. Finding the nuggets of wisdom will be much harder than before, the "bomb" was dropped.

Thus actually leading to the whole FO4 main plot.

Yes, life will always find a way. And yes humanity can not put the genie in a bottle, we are much more likely to put it on a Fat Man catapult.

But it means, that in a sense... that we will all have to accept this background radiation of AI shit, as part of our new norm.

And this isn't the first time I've thought in similar ways. I remember reading older math texts (way pre-computer works in things like diffeq and PDE) and often thinking the explanations were clearer. Probably because of the increased effort to actually print something.

Who knows... maybe I'm just an old coot seeing patterns where there are none.

reply
somenameforme
1 year ago
[-]
> "I remember reading older math texts (way pre-computer works in things like diffeq and PDE) and often thinking the explanations were clearer."

This is 100% the case in chess as well. The books before and after the computer era are orders of magnitude different in terms of readability. I think a major shift in society has been in motivation. In the past if you were studying advanced mathematics, let alone writing about it, it was solely and exclusively because you absolutely loved the field. And you were also probably several sigmas outside the mean intellectually. Now? It's most often because of some vague direction such as wanting a decent paying job, which may through the twists and turns of fate eventually see you writing books in a subject you don't particularly have much enthusiasm for.

And the much more crowded 'intellectual market', alongside various image crafting or signaling motivations, also creates a really perverse incentive. Exceptional competence and understanding within a subject makes it easy to explain things, even the esoteric and fabulously complex. See: Richard Feynman. But in modern times there often seems to be a desire to go the other direction - and make things sound utterly complex, even when they aren't. I think Einstein's paper on special relativity vs many (most?) modern papers is a good example. Einstein's writing was such that anybody with a basic education could clearly understand the paper, even if they might not fully follow the math. By contrast, so many modern papers seem to be written as if the author had a well worn copy of the Thesaurus of Incomprehensibility at his bedside table.

reply
ableal
1 year ago
[-]
> Because sufficient amounts of quantity always beats quality

Hegel (as echoed by Marx): "merely quantitative differences beyond a certain point pass into qualitative changes" ( https://www.pnas.org/doi/10.1073/pnas.240462397 )

I always found that an interesting observation, whatever you think of the rest of their works.

reply
3cats-in-a-coat
1 year ago
[-]
It's an absolutely valid observation and I cite it often. We need to learn as a society how to separate things and think less wholesale.

The best people have silly beliefs, the worst people have great insights, and the vast majority of us are in-between.

If we discard everything "tainted" by imperfection, we'll be left with nothing good.

reply
Der_Einzige
1 year ago
[-]
Hegel is a hack and doesn’t deserve to be cited here. Consider that he’s really only famous because his class motivated a bunch of other philosophers (the young Hegelian's, Marx, Stirner, Bruno Bauer et al) to meet in wine bars after class to complain about how terrible/impossible to understand his philosophy is.

Anyone whose even tried to read stuff of his I.e the phenomenology of spirit will tell you that he’s a charlatan and hack, and the people who constantly cite him (I.e Zizek, Lacan, Foucault) are also hacks.

reply
ableal
1 year ago
[-]
Hey, can't a blind pig find an acorn? Is it less of an acorn because it was found by a pig?

Myself, I appreciate Schopenhauer a lot more, and we know what he thought of Hegel, but I'm not fanatical about it. If a hack hits on a good line, I'll nab it.

reply
mo_42
1 year ago
[-]
Why not? A paper is not necessarily scientific nor a breakthrough. In my view, a paper is written and documented communication that's usually approved by peers in the field. Also a blunt observation in nature can be noteworthy. However, we don't see such papers anymore as these fields have matured. Just go back in the history of your field and you will find trivial papers.
reply
esalman
1 year ago
[-]
In the medical field, letters and case studies often document observations that may not be groundbreaking. However, scientific journals typically feature content that contributes to existing knowledge, making it somewhat novel. Consequently, presenting a set of POST parameters as an arXiv paper could be perceived as undermining the integrity of the entire preprint service.
reply
henriquez
1 year ago
[-]
Real science is reserved for those with real expertise! As the self-anointed gatekeeper of real science I decree that other peoples’ work fails to meet the minimum standard I have set for real science! Mind you not the work other actors in the scientific community publish and accept among their peers - they are not real scientists and their work is trivial. For shame!
reply
mmkos
1 year ago
[-]
Strongly disagree. I do think trivial work is not paper-worthy and it would be more beneficial not to publish such work, as it mostly a waste of time for the peer reviewing it and the readers who will gain nothing from reading it. It's no lie that most publish for the sake of publishing and this post just calls it out for what it is.
reply
henriquez
1 year ago
[-]
Trivial means different things to different people. I’m not really a fan of LLM hype but it seems to me a valid practice of scientific discovery to evaluate the use and optimization of such models.
reply
brookst
1 year ago
[-]
The real shame is that even HN has fallen into the trap of missing obvious and funny sarcasm unless it is clearly labeled.
reply
KyleBerezin
1 year ago
[-]
Sarcasm is often spoken with a sarcastic inflection. It doesn't translate well to text, regardless of the community.
reply
carbocation
1 year ago
[-]
The art and science of building these models is not disputed, but I think that the scientific value of prompts is tightly linked to reproducibility.

If you’ve developed a new prompt for a model whose weights you can directly access, then this prompt could have scientific value because its utility will not diminish over time or be erased with a new model update. I’m even generally of the view that a closed API endpoint whose expiration date is years into the future could have some value (but much less so). But simply finding a prompt for something like ChatGPT is not useful for science because we don’t even have certainty about which model it’s executing against.

Note that some of the best uses of these models and prompting have nothing to do with academics; this is a comment focused on the idea about writing academic papers about prompts.

reply
skilled
1 year ago
[-]
I can maybe understand the frustration from a “scientific” perspective, but for a lot of these “one prompt papers” - you still need someone to sit down and do the analysis and comparisons. Very few papers focus only on GPT/ChatGPT.

Additionally, it gives people other ideas to try for themselves. And some of this stuff might be useful to someone in a specific scenario.

It’s not glamorous research or even future-proof seeing as how certain prompts can be surgically removed or blocked by the owner of the model, but I don’t think it warrants telling people not to do it.

reply
gandalfgeek
1 year ago
[-]
If a new prompt enables a new task or enhances performance on a task then it absolutely should be published.

Back in the day would compiler optimizations be not worthy of publishing?

reply
snet0
1 year ago
[-]
It's hard to draw these lines, because you will certainly filter out a lot of bad (i.e. useless, low contribution to any field) papers, but you might also filter out some really important papers. Research being basic or something anyone could've done doesn't count againt its potential importance, just the expected value of importance I guess.

I'd rather we had a few too many bad papers than a few too few great papers.

reply
yieldcrv
1 year ago
[-]
Arxiv is like the MENSA of the tech world

The similarity being that it’s ego masquerading as academic.

Most things shared from there should have just been a blog post.

The last year has showed that AI/ML research and use did not need academic gatekeeping by PhDs and yet many in that scene keep trying self infatuating things with the lowest utility.

reply
etewiah
1 year ago
[-]
Behind all this is a valid question. How does one evaluate prompts and LLMs? As gipeties (custom gpts) become more popular millions of hours will be wasted by ones that have been built badly. Without some sort of automated quality control, gipeties will become a victim of their own success.
reply
potatoman22
1 year ago
[-]
What's the difference between a paper on a new prompt and a paper discussing a new domain-specific model, e.g. heart failure risk? If they analyze the problem and solution equally, they both seem useful. It's not like most other ML papers share their weights or datasets.
reply
JR1427
1 year ago
[-]
This reminds me of how there was a boom in half-baked studies around COVID, e.g. modelling this or that aspect of the pandemic, or around mask wearing.

I imagine that most of these will simply have had little to no impact, and will only serve to bolster the publication list of those who wrote them.

reply
u32480932048
1 year ago
[-]
ChatGPT droppings have to be at least as relevant and newsworthy as these findings [1]

  [1] https://www.wbur.org/news/2022/07/27/harvard-shorenstein-research-january-6-insurrection-president
reply
SamBam
1 year ago
[-]
Tired: Asking participants to sign an ethics pledge at the top of a tax return makes them more honest.

Wired: Asking an LLM to write out their steps first makes them more accurate.

They seem equally interesting to me, but one is a lot easier to replicate, and the other is easier to lie about.

reply
karxxm
1 year ago
[-]
It depends I guess.

If you solve a problem that had been around for a while and LLMs offer a new way of approaching it, then it can definitely become a paper.

Of cause one has to verify in sophisticated experiments, that this approach is stable.

reply
jdefr89
1 year ago
[-]
I am sorry but what can ChatGPT do that a couple of minutes of googling couldn’t solved? Write half hearted essays that all contain the same phrase?
reply
DrawTR
1 year ago
[-]
Anything generative? At its core, Google doesn't 'make' anything when you query it.
reply
empath-nirvana
1 year ago
[-]
Why don't you spend 15 minutes playing around with it and see what you can get it to do that google can't do?
reply
potatoman22
1 year ago
[-]
Generative LLMs can be turned into classifiers quite easily, search engines cannot.
reply
d4rkp4ttern
1 year ago
[-]
Yes these papers are optimizing for social media hype, I.e what is the quickest and easiest path to make noise on social media ?
reply
coldtea
1 year ago
[-]
Sorry, but if they can get away with it, they'll release it as a paper.

It's not like most papers are much above that anyway...

reply
samlhuillier
1 year ago
[-]
Times are changing. Human researchers will dedicate more and more time towards getting language models to work in desired ways rather than doing the research themselves. Language models will largely be the ones making "research" discoveries. Both should be considered valid research IMO.
reply
Racing0461
1 year ago
[-]
Doesn't academia incentivise quantity over quality anyways?
reply
darepublic
1 year ago
[-]
A new prompt is not a paper, but you can prompt it for a paper.
reply
gumballindie
1 year ago
[-]
Anyone caught doing this should be kicked out of the industry. Period. You're scaming those funding your "research", you are misleading readers, and are producing low quality content wasting everyone's time.
reply
selfhoster11
1 year ago
[-]
Excuse me? Step by step wasn't paper-worthy? Hard disagree.

LLM research is currently in its infancy, because they are no older than a few years old. And a research field in its infancy is bound to have a few noteworthy "no sh*t, Sherlock" papers that would be obvious from hindsight.

The fact is, LLMs are a higher-order construct in machine learning, much like a fish is higher-order than a simple cellular colony. Lower-order ML constructs do not demonstrate emergent capabilities like step by step, stream of consciousness thinking, and so on.

Academics should be less jaded and approach the field with beginner's eyes. Because we are all beginners here.

reply
tensor
1 year ago
[-]
I'm not surprised at the defence of "prompt engineering" here. It's something easy to do with no real knowledge, and I'm sure having it dismissed hurts some people.

But I 100% agree with the author, "prompt engineering" is not science, and I'd say it's not engineering either. All you're doing is exploring the parameter space of particular model in a very crude way. There is no "engineering" going on in this process, just a bunch of trial and error. Perhaps it should be called "prompt guessing."

None of the results of this process will transfer to any other model. It's simply not science. Papers like "step-by-step" are different, and relate more to learning and inference and do translate to different models and even different architectures.

Also, no, we are not all beginners here. Language models have a long history, and while the very large models are impressive, most of their failings have been known for a very long time already. Things like "prompt engineering" will eventually end up in the same graveyard as "keyword engineers" of the past.

reply
adastra22
1 year ago
[-]
> But I 100% agree with the author, "prompt engineering" is not science, and I'd say it's not engineering either. All you're doing is exploring the parameter space of particular model in a very crude way. There is no "engineering" going on in this process, just a bunch of trial and error.

I wonder what your definition of “science” or “engineering” is…

reply
gwervc
1 year ago
[-]
If you remove the AI glasses, "prompt engineering" is just typing words and seeing if results match the expectations... which is exactly what any search engine pays their testers for. Those testers are making an important job to keep improving the quality of the product but they aren't engineers and even less so researchers.

Similarly a kid playing with the dose of water needed to build a sandcastle isn't a civil engineer nor an environmental researcher. Maybe on LinkedIn though.

reply
blueboo
1 year ago
[-]
I’m not sure the scientific method itself can withstand this sort of scrutiny. After all, it’s just making guesses about what will happen and then seeing what happens!
reply
j2kun
1 year ago
[-]
Except there's also, you know, building coherent theories and using those theories to predict the system behavior.
reply
selfhoster11
1 year ago
[-]
All right, here is a theory: LLMs contain "latent knowledge" that is sometimes used by the model during inference, and sometimes it isn't.

One way to "engage" these internal representations is to include keywords or patterns of text that make that latent knowledge more likely to "activate". Say, if you want to ask about palm trees, include a paragraph talking about a species of a palm tree (no matter whether it contains any information pertaining to the actual query, so long it's "thematically" right) to make a higher quality completion more likely.

It might not be the actual truth or what's going on inside the model. But it works quite consistently when applied to prompt engineering, and produces visibly improved results.

reply
j2kun
1 year ago
[-]
> It might not be the actual truth or what's going on inside the model.

This sums up pretty nicely why prompt hacking is not science. A scientific theory is related in a concrete way to the mechanism by which the phenomenon being studied works.

reply
gowld
1 year ago
[-]
That is in no way a requirement for doing science.
reply
naremu
1 year ago
[-]
It's funny how often I see people make bring up the "did you know" tidbit about software engineering not being "real" engineering in a traditional sense, which seems to go very uncontroversially.

But prompt engineering is still a pressure point for some people, despite being wildly more simple and accessible (literally tell the thing to do a thing, and if it doesn't do the thing right, reword)

It feels as though we're getting to the technological equivalent of "what IS art anyways", and questions like if non traditional forms like video games are art (I'm thinking all the way up the chain to even say, Madden games)

And in my experience, when something is under constant questioning of whether or not it even counts as X, Y or Z, it usually can technically qualify, but...

If people are constantly debating whether or not it's even X, it's probably just not impressing people who don't engage in it, as opposed to "traditional" concepts of engineering and art, and part of the impression made comes from the investment and irreplaceable skillsets, things few, if anyone else at the time could have done.

This is why taping a banana on the wall is definitely technically art, but not many outside the art community that tapes bananas to walls really think much of it. It's so mundane and accessible a feat that it doesn't garner much merit to passerbys. It's art by the loosest technical definition, and is giving a lot of credit for a small amount of effort anyone could've done.

Admittedly "prompt engineering" is definitely less accessible than a roll of duct tape and a banana but I think we used to just call it "writing/communication", but I guess those who feel capable at that, often just do it manually anyways.

reply
0xfae
1 year ago
[-]
Right? I'm having a hard time imagining a definition that includes "trying new things and seeing what happens" but that doesn't include... "trying new things and seeing what happens"
reply
aeternum
1 year ago
[-]
"Science" has been twisted recently into a kind of witchcraft that can only be practiced by those anointed through the rigors of academia.

"Trust the science"

In reality, that is about the furthest from what you should do. As Feynman once said: "Science is the belief in the ignorance of experts". Electricity was also once considered a toy and good for nothing but parlor tricks.

reply
roguas
1 year ago
[-]
Especially given this would be fine definition of engineering+science: "All you're doing is exploring the parameter space of particular model in a very crude way."
reply
Tarq0n
1 year ago
[-]
Science should aim to create general (that is, generalized or generalizable) knowledge. One prompt is just an anecdote, a method for creating performant prompts or deriving prompts from model characteristics would be more scientific.
reply
selfhoster11
1 year ago
[-]
> All you're doing is exploring the parameter space of particular model in a very crude way.

Yes. I come to think of prompt engineering as, in a sense, doing an approximate SELECT query on the latent behavioural space (excuse my lack of proper terminology, my background in ML is pretty thin) that can be thought of as "fishing out" the agent/personality/simulator that is most likely to give you the kind of answer you want. Of course a prompt is a very crude way to explore this space, but to me this is a consequence of extremely poor tooling. For one, llama.cpp now has negative prompts, while the GPT-4 API will probably never have them. So we make-do with the interface available.

> There is no "engineering" going on in this process, just a bunch of trial and error. Perhaps it should be called "prompt guessing."

That is incorrect. It is true that there is a lot of trial and error, yes. But it's not true that it's pure guessing either. While my approach can be best described as a systematic variant of vibe-driven development, at its core it's quite similar to genetic programming. The prompt is mutable, and it's efficacy is possible to evaluate at least in a qualitative sense vs the last version of the prompt. By iterative mutation (rephrasing, restructuring/refactoring the whole prompt, changing out synonyms, adding or removing formatting, adding or removing instructions and contextual information), it is possible to iterate from a terrible initial prompt to a much more elaborate prompt that gets you 90-97% of the way towards nearly exactly what you want to do, by combining the addition of new techniques with subjective judgement on how to proceed (which is incidentally not too different from some strains of classical programming). On GPT-4, at least.

> None of the results of this process will transfer to any other model.

Is that so? Yes, models are somewhat idiosyncratic, and you cannot just drag and drop the same prompt between them. But, in my admittedly limited experience of cross-model prompt engineering, I have found that techniques which helped me to achieve better results with the untuned GPT-3 base model, also helped me greatly with the 7B Llama 1 models. I hypothesise that (in the absence of muddling factors like RLHF-induced censorship of model output), similarly sized models should perform similarly on similar (not necessarily identical) queries. For the time being, this hypothesis is impossible to test because the only realistic peer to GPT-4 (i. e. Claude) is lobotomised to the extent where I would outright pay a premium to not have to use it. I have more to say on this, but won't unless you ask in the interests of brevity.

> Language models have a long history, and while the very large models are impressive, most of their failings have been known for a very long time already. Things like "prompt engineering" will eventually end up in the same graveyard as "keyword engineers" of the past.

Language models have a long history, but a Markov chain can hardly be asked to create a basic Python client for a novel online API. I will also dispute the assertion that we know the "failings" of large language models. Several times now, previously "impossible" tasks have been proven eminently possible by further research and/or re-testing on improved models (better-trained, larger, novel fine-tuning techniques, etc). I am far from being on the LLM hype train, or saying they can do everything that optimists hope they can do. All I'm saying, is that the academia is doing itself a disservice by not looking at the field as something to be explored with no preconceptions, positive or negative.

reply
_gabe_
1 year ago
[-]
I feel like the author of this tweet wasn’t saying step-by-step isn’t worthy, he was saying that non-reproducible results are not science. He emphasizes this twice in that tweet:

> one experiment on one data set with seed picking is not worthy reporting

> Additionally, we all need to understand this is just one good empirical result, now we need to make it useful…

reply
low_tech_love
1 year ago
[-]
Exactly, and I tend to agree with him. I argued some time ago here that a paper should take some time to try to explain why its results are happening, at least from a reasonable hypothesis (people didn't seem to agree). An experiment (even a simple one) starts from a null hypothesis and tries to disprove it. However, most of what we see coming out of "scientific" papers is basically just engineering, I guess?: we put all of these things together in some way (out of pure guess and/or preference bias) and these results happened. We don't know why, good luck figuring it out. Here is one example where it works (don't ask where it doesn't; we intentionally kept those out).

And while I obviously value very much the engineering advances we have seen, the science is still lacking, because not enough people are trying to understand why these things are happening. Although engineering advances are important and valuable, I don't understand exactly why people try so hard to call themselves scientists if they are basically skipping the scientific process entirely.

reply
jimbokun
1 year ago
[-]
Lots of onions in the varnish.
reply
roguas
1 year ago
[-]
Is it non-reproducible? Also results which reproducibility can be measured and appears stable is perfectly good science. I dislike when people throw statements like that.
reply
_gabe_
1 year ago
[-]
I have no idea. The author of that tweet seems to imply that the results aren’t reproducible. I was just commenting to point out that the author’s intent may have been different from what the grandparent comment was saying.
reply
phkahler
1 year ago
[-]
>> LLM research is currently in its infancy

Everything that went in to creating GPT4 is AI/science or whatever. Probing GPT4 and trying to understand and characterize it is also a very worthy thing to do - else how can it be improved upon? But if making GPT is science, I'd say this stuff is more akin to psychology ;-)

reply
dr_dshiv
1 year ago
[-]
Machine psychology
reply
jimbokun
1 year ago
[-]
reply
dr_dshiv
1 year ago
[-]
reply
jjordan
1 year ago
[-]
2080 graduates of Machine Psychology will look back on this post and smile.
reply
jejeyyy77
1 year ago
[-]
you realize nobody understands WHY or HOW these models work under the hood right?

it's akin to evolution - we understand the process - that part is simple. But the output/organisms we have to investigate how they work.

reply
kergonath
1 year ago
[-]
> you realize nobody understands WHY or HOW these models work under the hood right?

Of course we understand how they work, we built them! There is no mystery in their mechanisms, we know the number of neurons, their connectivity, everything from the weights to the activation functions. This is not a mystery, this is several decades of technical developments.

> it's akin to evolution - we understand the process - that part is simple.

There is nothing simple about evolution. Things like horizontal gene transfer is very much not obvious, and the effect of things like environment is a field of active research.

> But the output/organisms we have to investigate how they work.

There is a fundamental difference with neural networks here: there are a lot of molecules in an animal’s body about which we have no clue. Similarly, we don’t know what a lot of almost any animal’s DNA encodes. Model species that are entirely mapped are few and far between. An artificial neural network is built from simple bricks that interact in well defined ways. We really cannot say the same thing about chemistry in general, much less bio chemistry.

reply
zamfi
1 year ago
[-]
> Of course we understand how they work, we built them! There is no mystery in their mechanisms, we know the number of neurons, their connectivity, everything from the weights to the activation functions. This is not a mystery, this is several decades of technical developments.

The discovery of DNA’s structure was heralded as containing the same explanatory power as you describe here.

Turns out, the story was much more complicated then, and is much more complicated now.

Anyone today who tells you they know why LLMs are capable of programming, and how they do it, is plainly lying to you.

We have built a complex system that we only understand well at a basic “well there are weights and there’s attention, I guess?” layer. Past that we only have speculation right now.

reply
kergonath
1 year ago
[-]
> The discovery of DNA’s structure was heralded as containing the same explanatory power as you describe here.

Not at all. It's like saying that since we can read hieroglyphics we know all about ancient Egypt. Deciphering DNA is tool to understand biology, it is not that understanding in itself.

> Turns out, the story was much more complicated then, and is much more complicated now.

We are reverse engineering biology. We are building artificial intelligence. There is a fundamental difference and equating them is fundamentally misunderstanding both of them.

> Anyone today who tells you they know why LLMs are capable of programming, and how they do it, is plainly lying to you.

How so? They can do it because we taught them, there is no magic.

> We have built a complex system that we only understand well at a basic “well there are weights and there’s attention, I guess?” layer. Past that we only have speculation right now.

Exactly in the same way that nobody understand in detail how a complex modern SoC works. Again, there is no magic.

reply
phkahler
1 year ago
[-]
>> Exactly in the same way that nobody understand in detail how a complex modern SoC works. Again, there is no magic.

That's absolute BS. Every part of a SoC was designed by a person for a specific function. It's possible for an individual to understand - in detail - large portions of SoC circuitry. How any function of it works could be described in detail down to the transistor level by the design team if needed - without monitoring its behavior.

reply
zamfi
1 year ago
[-]
> How so? They can do it because we taught them, there is no magic.

Yeah, no. I mean, we can’t introspect the system to see how it actually does programming at any useful level of abstraction. “Because we taught them” is about as useful a statement as “because its genetic parents were that way”.

No, of course it’s not magic. But that doesn’t mean we understand it at a useful level.

reply
shwouchk
1 year ago
[-]
Why stop at chemistry? Chemistry is fundamentally quantum electrodynamics applied to huge ensembles of particles. QED is very well understood and gives the best predictions we have to date of any scientific theory.

How come we don’t entirely understand biology then?

reply
kergonath
1 year ago
[-]
> Why stop at chemistry? Chemistry is fundamentally quantum electrodynamics applied to huge ensembles of particles.

Chemistry is indeed applied QED ;) (and you don't need massive numbers of particles to have very complex chemistry)

> How come we don’t entirely understand biology then?

We understand some of the basics (even QED is not reality). That understanding comes from bottom-up studies of biochemistry, but most of it comes from top-down observation of whatever there happens to be around us. The trouble is that we are using this imperfect understanding of the basics to reverse engineer an insanely complex system that involves phenomena spanning 9 orders of magnitude both in space and time.

LLMs did not spawn on their own. There is a continuous progression from the perceptron to GPT-4, each one building on the previous generation, and every step was purposeful and documented. There is no sudden jump, merely an exponential progression over decades. It's fundamentally very different from anything we can see in nature, where nothing was designed and everything appears from fundamental phenomena we don't understand.

As I said, imagining that the current state of AI is anything like biology is a profound misunderstanding of the complexity of both. We like to think we're gods, but we're really children in a sand box.

reply
shwouchk
1 year ago
[-]
I will ignore your patronizing remarks beyond acknowledging them here, in order to promote civil discourse.

I think you have missed my point by focusing on biology as an extremely complex field.e, it was my mistake to use it as an example in the first place. We don’t need to go that far;

sure, llms did not spawn on their own. They are a result of thousands of years of progress in countless fields of science and engineering. Like any modern invention, essentially.

Here I remember to make sure we are on the same page on what we’re discussing - as I understand, whether “prompt engineering” can be considered an engineering/science practice. Personally I haven’t considered this enough to form an opinion but your argument does not sound convincing to me;

I guess your idea of what llms represent matters here. The way I see it, in some abstract sense we are as society exploring a current peak - in compute $ or flops and performance on certain tasks - of a rather large but also narrow family of functions. By focusing our attention on functions composed of ones we understood how to effectively find parameters for, we were able to build at this point rather complicated processes for finding parameters for the compositions.

Yes, the components are understood, at various levels of rigor, but the thing produced is not yet sufficiently understood. Partly out of cost to reproduce such research, and partly due to complexity of the system, a driver for the cost.

The fact that “prompt engineering” as a practice and that companies supposedly base their business model on secret prompts is a testament, for me, to the fact they are not well understood. A well understood system you design has a well understood interface.

Now, I haven’t noticed a specific post OP was criticizing so i take it his remarks were general. He seems to thinks that some research is not worth publishing. I tend to agree that I would like research to be of high quality, but that is subjective. Is it novel? is it true?

Now, progress will be progress and im sure current architectures will change and models will get larger. And it may be that a few giants are the only one running models large enough to require prompt engineering. Or we may find a way to have those models understand us better than a human ever could. Doubtful. And post singularity anyway, by definition.

In either case yes, probably temporary profession. But in case open research will continue in those directions as well, there will be need for people to figure out ways to communicate effectively with these. You dismiss them as testers.

However, progress in science and engineering is often driven by data where theory is lacking and I’m not aware of the existence of deep theory as of yet. eg something that would predict how well a certain architecture would perform. Engineering ahead of theory, driven by $).

As in physics that we both mentioned, knowing the component part does not automatically grant you understanding of the whole. knowing everything there is to know about the relevant physical interaction, protein folding was a tough problem that AFAIR has had a lot of success with tools from the field. Square in the realm of physics even, and we can’t give good predictions without testing (computationally).

If someone tested some folding algorithm and visually inspected results, then found a trick how to consistently improve on the result in some subcase of proteins. Would that be worthy of publishing? if yes, why is this different? if not, why not?

reply
jejeyyy77
1 year ago
[-]
We designed the process. We didn't design the models - the models were "designed" based on the features of a massive dataset and massive number of iterations.

Even if you understand evolution - you still don't understand how the human body or mind works. That needs to be investigated and discovered.

In the same way, you understanding how these models were trained doesn't help you understand how the models work. That needs to be investigated and discovered.

reply
dartos
1 year ago
[-]
What is psychology, but applied biology?

https://xkcd.com/435/

reply
eimrine
1 year ago
[-]
Psychology is a religious-like pseudoscience, they can not even define what "psy" is without using some conceptions from religions such as a soul.

upd these statements from me are so controversial, the number of "points" just dances lambada. The psy* areas are clearly polarized: some guys upvote all my messages in this topic and some other ones downvote all my messages in this topic. This is a sign of something interesting but I am not ready to elaborate on this statement in this comment which is going to become [flagged] eventually.

reply
phkahler
1 year ago
[-]
>> This is a sign of something interesting

Yeah, you might understand it if you study psychology or perhaps sociology.

reply
fkyoureadthedoc
1 year ago
[-]
> This is a sign of something interesting but I am not ready to elaborate on this statement in this comment which is going to become [flagged] eventually.

Yet you're being a reply guy all over this thread, might as well just elaborate you clearly have the time and interest

reply
eimrine
1 year ago
[-]
Psy* pseudoscience is among a few hills I am gladly die for. Also free/libre software, Lisp and cryptocurrency with no premine.
reply
beepbooptheory
1 year ago
[-]
Dude if you think φύσις is free of its own philosophical baggage I got a bridge to sell you.
reply
eimrine
1 year ago
[-]
> its own philosophical baggage

Could you elaborate on this statement?

reply
beepbooptheory
1 year ago
[-]
If your contention is just something like "the root psy- comes out of mystical/spiritual conceptions in Ancient Greece, and that speaks to the bunk/ungrounded conceptions of modern psychology," then I would ask why the same critique is not levied against the ancient Greek conception of "nature" and the "natural" from which we get the word "physics".

You might retort here "ah well, 'nature' is just the word we use when we speak of observable phenomena in the hard sciences, its not muddied by religion like that crock stuff psychology."

And then I would say, "ok, if 'nature' is just observable phenomena, what is the aim or purpose of the hard sciences? If it is all just observing/experimenting on discrete phenomena, there would be nothing we could do or conclude from the rigor of physics."

You laugh at my insanity (well, if you believed in such a thing): "But we do conclude things from physics, because experiments are reproducible, and with their reproducibility we can gain confidence in generalizing the laws of our universe."

And yes! You would be correct here. But now all the sudden you have committed physics to something just as fundamentally "spiritual" as the soul: that the universe is sensible, rational, and "with laws." Which is indeed just speaking the very same mystical "nature" of ancient Greece from which we get phys-.

But this need not be some damning critique of physics itself (like psychology), and rather, can lead to a higher level understanding of all scientific pursuits: that we are everywhere cursed by a fundamental incompleteness, that in order even to enter into scientific pursuit we must shed an absolute skepticism for a qualified one. Because this is the only way we accumulate a network of reinforced hypotheses and conceptions, which do indeed help us navigate the purely phenomenal world we are bound in.

reply
Woshiwuja
1 year ago
[-]
What? Are you literally referring to ancient greek psy? jesus...
reply
eimrine
1 year ago
[-]
What is incorrect in this reference? You have not proposed any counterarguments. Also if you need just more fresh data - how do you propose to interpret the result of the Rosenhan's experiment?
reply
wavemode
1 year ago
[-]
That lying about your symptoms to doctors leads to incorrect diagnoses?
reply
eimrine
1 year ago
[-]
They are _not_ doctors in terms of evidence-based medicine, just policemen without a token. The problem is obviously not about incorrect diagnosis, I can lie to any doctor about any symptoms and just go home with zero obstructions from feds.
reply
nick222226
1 year ago
[-]
As someone who was formerly in a mental ward for acute crisis, I would say that at least the 72 hour hold was an essential and necessary part of my treatment. I don't think that staying at home with unprepared family members for the acute period would have worked out, and I don't even have a problematic home environment!

The flip side of the coin is that I was in a really high quality hospital, I'm sure there are hospitals or facilities that can be more harmful rather than helpful.

I also have a problem with the way that they treat mental health like cancer, that once you have a diagnosis you will always have it. There are zero diagnostic criteria for "fully recovered" or removing dependence on medication, even after 5 or 10 years. It's also treated like a scarlet letter for insurance and unrelated things like TSA pre check - no matter how well you are doing you are still some level of risk to yourself and society. Though I could be wrong... the reoccurrence chart over time for my specific acute mania (with no depressive episodes) does look a lot like cancer remission charts with asymptotic approach to 80%+ reoccurrence after 2-4 years.

reply
Woshiwuja
1 year ago
[-]
Dont put people inside mental asylums when they are not ill?
reply
eimrine
1 year ago
[-]
There is no evidence of existing at least one defined illness in psy* fields. For example, let me tell you that a person X fell ill with schizophrenia. What do you know about X or X's brain?
reply
oasisbob
1 year ago
[-]
Most psychologists would be much more interested in person X's behavior, rather than their brain.
reply
eimrine
1 year ago
[-]
And believe in such false statement as a free will?

So what is a part of organism (if not brain) which might be possible for psy* specialist to heal, is it an arm or a leg or a spine?

reply
thesz
1 year ago
[-]
> Lower-order ML constructs do not demonstrate emergent capabilities like step by step, stream of consciousness thinking, and so on.

As a matter of fact, I did a project on the normalization of the text, e.g., translate "crossing of 6 a. and 12 s." into "crossing of sixth avenue and 12-th street" with a simple LM (order 3) and beam search on the lattice paths, lattice formed with hypotheses' variants. I got two fold decrease of word error rate compared to simpler approach with just outputting the most probable WFST path. It was not "step by step stream of consciousness," but nevertheless very impressive feat, when system started to know more without much effort.

The large LM's do not just output "most probable" token, they output most probable sequence of tokens and it is done with the beam search.

As you can see, my experience tells me that beam search alone can noticeably, if not tremendously, improve quality of the output, even for very simple LMs.

And if I may, the higher-order construct here is a beam search, not the LMs-as-matrix-coefficients' themselves. Beam search is used in speech recognition for decades now, SR does not work properly without it. LM's, apparently, also do not work without it.

reply
azinman2
1 year ago
[-]
Is there something you can link to? I’d like to learn more.
reply
thesz
1 year ago
[-]
Here it is: https://huggingface.co/blog/how-to-generate

Beam search at Wikipedia: https://en.wikipedia.org/wiki/Beam_search

Beam search in Sqlite: https://www.sqlite.org/queryplanner-ng.html#_a_difficult_cas...

Beam search is more interesting than its' application within AI field.

reply
lumost
1 year ago
[-]
Anecdotally,

The field of ML suffered from a problem where there were more entrants to the field than available positions/viable work. In many industrial positions, it was possible to hide a lack of progress behind ambiguity and/or poor metrics. This lead to a large amount of gate keeping for productive work as ultimately there wasn't enough to go around in the typical case.

This attitude is somewhat pervasive leading to blogs like the above. Granted, the Nth prompting paper probably isn't interesting - but new programming languages for prompts and prompt discovery techniques are very exciting. I wouldn't be surprised if it turned out that automatic prompt expansion using a small pre-processing model turns out to be an effective technique.

reply
renonce
1 year ago
[-]
I would say rather than saying that a new prompt is not science, it’s certainly a new discovery that is worth sharing. Maybe there should be a higher bar for papers, but why we have to make a discovery a paper - and not publish it at all if it cannot be made one - when a simple blog post or a tweet would convey the discovery very well?
reply
zxt_tzx
1 year ago
[-]
On a related note, there is this recent tweet purportedly showing that "offering to give a tip to ChatGPT" improves performance (or at the very least resulted in longer responses, which might not be a good proxy for performance) https://twitter.com/voooooogel/status/1730726744314069190
reply
ben_w
1 year ago
[-]
I'm reading this tweet as saying "you can't write a paper by prompting an LLM in these ways" rather than "you can't write a paper characterising the impact of prompting an LLM in these ways".

I'd agree the former won't get you a complete anything (longer than ~30 lines) by itself (90-95% cool, but with some incredible errors in the other 5-10%).

I'd also agree that the latter is worthy of publishing.

reply
WhitneyLand
1 year ago
[-]
What you’re saying seems to be compounded by the black box aspects here.

Instinctively a prompt may look like one line of code. We can often know or prove what a compiler is doing, but higher dimensional space is just not understood in the same way.

What other engineered thing in history has had this much immediately useful emergent capability? Genetic algorithms finding antenna designs and flocking algorithms are fantastic, but I would argue narrower in scope.

Of course a paper is still expected to expand knowledge, to have rigor and impact, but I don’t see why a prompt centric contribution would inherently preclude this.

reply
nipponese
1 year ago
[-]
Sorry, ignoramus here: Which paper is “Step by step”?
reply
zxt_tzx
1 year ago
[-]
I think it's a reference to the "discovery" that if you ask GPT-4 to answer your query "step by step", it'll actually offer a better response than otherwise.
reply
jebarker
1 year ago
[-]
> emergent capabilities like step by step, stream of consciousness thinking

What makes these things "emergent capabilities"? They seem like pretty straightforward consequences of autoregressive generation. If you feed output back as input then you'll get more output conditioned on that new input and stream of conscious generation is just stochastic parroting isn't it?

reply
selfhoster11
1 year ago
[-]
They are emergent in the sense that there is nothing in the pre-training dataset that would show the LLM by example how to, for example, compare and contrast any given pairing of fruit, technologies, or fictional settings, while thinking with the mindset of a doctor that hates both options, and on top of that make sure that this ends up formatted as a stream-of-consciousness. It can learn all these aspects from the source data individually in isolation, but there's no way there are examples that show how to combine it all (awareness of world information + knowledge of how to use it) into a single answer. That's probably a very clumsy example - others online have supplied more rigorous ones that I recommend checking out.

Strictly speaking, it might be "stochastic parroting". But really, if you want to be a great and supremely effective stochastic parrot, you have to learn an internal representation of certain things so that you can predict them. And there are hints that this is exactly what a sufficiently-large large language model is doing.

reply
skyde
1 year ago
[-]
I have a question For people in the field!

Would GPT only with pretraining and no fine tuning be able to behave better when prompt is "let’s think step by step"?

Or this prompt only worked because a fine tuning dataset containing many "let’s think step by step" prompt was used?

reply
selfhoster11
1 year ago
[-]
Working with pure pre-trained models is quite hard, and takes some practice. The key part is that "let's think step by step" is a technique that is used by humans also, and therefore (I think) would be somewhat represented in the pre-training corpus. It would be somewhat harder to "activate" this mode of thinking than a "let's think step by step" would in a fine-tuned model, but it would be possible with some elbow grease.
reply
Der_Einzige
1 year ago
[-]
NLP/Computational Lingustics is NOT a new field. The ACL was founded in 1962

LM research is also old, papers using very shitty LMs (I.e Markov chains) and discussing them have existed since the early 2000s and likely before.

Check yourself before you try to check others.

reply
Retr0id
1 year ago
[-]
Try asking a markov chain to think step by step
reply
tovej
1 year ago
[-]
By definition a Markov chain does everything step by step already, no need to ask!
reply
Retr0id
1 year ago
[-]
By that logic, so does a GPT
reply
idiliv
1 year ago
[-]
Parent post is talking about LLMs, i.e. Large LMs. Research on LLMs is indeed in its infancy.
reply
alickz
1 year ago
[-]
Still beats most psychology papers
reply
margorczynski
1 year ago
[-]
Tbh in both they are mostly alchemy with some more or less expert lingo put in to make it sound more scientific.

AI/ML resembles more alchemy than e.g. physics, putting stuff into a pot and seeing what comes out. A lot of the math in those papers doesn't provide anything but some truisms, most of it is throwing stuff at the wall and seeing what sticks.

reply
whywhywhywhy
1 year ago
[-]
Academia needs to get over itself, can't wait to see how amazing this tech is going to get when the next generation who decide never to bother with those stuffy and navel gazing institutions becomes the driving force behind it.

Looking forward to "I made this cool thing, here's the code/library you can use" rather than the papers/gatekeeping/ego stroking/"muh PhD".

Think if Google had built an AI team around the former rather than the latter, they wouldn't have risked the future of their entire company and squandered their decade head start.

reply