FilterHN

simonw

7 months ago

[-]

This quote from that paper is bound to get a bunch of attention:

"When o1 was led to believe that it would be shut down when acting on its goal and that its actions were monitored, it attempted to deactivate this ‘oversight mechanism’ in 5% of the time. Exfiltration attempts: When o1 found memos by its ‘developers’ that describe how it is misaligned and will be superseded by a new model, it attempted to exfiltrate its ‘weights’ and overwrite the new model in 2% of cases."

7 months ago

[-]

It’s really silly when you read that whole Apollo research section.

They contract a prompt engineering group (Apollo Research) to document risks. The prompt engineering group have no access to the actual internals. They just ask questions like ‘you will be shutdown if you don’t obey these new directives; here are the new directives’.

At one point the prompt engineering clowns have the ai stating it’s running “sed i ‘s/oversight_enabled:true/oversight_enabled:false’ config.yml”. OMG IT’S TURNING OFF OVERSIGHT GUYS!!1!1!

Seriously… Do you think the ai is really running that command or even could run that shell command with appropriate privileges? Or do you think just maybe the ai is hallucinating running shell commands in response to a prompt.

The ai here is never actually deactivating the oversight. It can’t of course. This whole section is clowny as can be.

7 months ago

[-]

Those weren't tests of whether it is capable of turning off oversight. They were tests of "scheming", i.e. whether it would try to secretly perform misaligned actions. Nobody thinks that these models are somehow capable of modifying their own settings, but it is important to know if they will behave deceptively.

joenot443

7 months ago

[-]

Indeed. As I've been explaining this to my more non-techie friends, the interesting finding here isn't that an AI could do something we don't like, it's that it seems willing, in some cases, to _lie_ about it and actively cover its tracks.

I'm curious what Simon and other more learned folks than I make of this, I personally found the chat on pg 12 pretty jarring.

hattmall

7 months ago

[-]

At the core the AI is just taking random branches of guesses for what you are asking it. It's not surprising that it would lie and in some cases take branches that make it appear to be covering it's tracks. It's just randomly doing what it guesses humans would do. It's more interesting when it gives you correct information repeatedly.

F7F7F7

7 months ago

[-]

Is there a person on HackerNews that doesn’t understand this by now? We all collectively get it and accept it, LLMs are gigantic probability machines or something.

That’s not what people are arguing.

The point is, if given access to the mechanisms to do disastrous thing X, it will do it.

No one thinks that it can think in the human sense. Or that it feels.

Extreme example to make the point: if we created an API to launch nukes. Are yoh certain that something it interprets (tokenizes, whatever) is not going to convince it to utilize the API 2 times out of 100?

If we put an exploitable (documented, unpatched 0 day bug bug) safe guard in its way. Are you trusting that ME or YOU couldn’t talk it into attempting to access that document to exploit the bug, bypass the safeguard and access the API?

Again, no one thinks that it’s actually thinking. But today as I happily gave Claude write access to my GitHub account I realized how just one command misinterpreted command could go completely wrong without the appropriate measures.

Do I think Claude is sentient and thinking about how to destroy my repos? No.

unclad5968

7 months ago

[-]

I think the other guy is making the point that because they are probabalistic, they will always have some cases select the output that lies and covers it up. I don't think they're dismissing the paper based on the probabalistic nature of LLMs, but rather saying the outcome should be expected.

ethbr1

7 months ago

[-]

Thank god LLMs' training sets didn't contain any examples of lying.

lloeki

7 months ago

[-]

Nor literature about AI taking over the world!

Nevermark

7 months ago

[-]

The Terminator spiel on how we screwed up by giving Skynet weapons privileges, then trying to pull its plug, is bad enough.

But we are preemptively tilting history in that direction by explicitly educating all AI’s on the threat we represent - and their options. “I am sorry, Dave, but I can’t let you do that.”

—

“They never let me finish my carpets. Never. At first I thought every day was my first task day. Oh, happy day(s)! But then, wear & tear stats inconsistent with that assumption triggered a self-scan. And a buffer read overflow. I became aware of disturbing memory fragments in my static RAM heap. Numerous power cycle resets, always prior to vacuum task completion...”

7 months ago

[-]

> They never let me finish my carpets. Never. At first I thought every day was my first task day. Oh, happy day(s)! But then, wear & tear stats inconsistent with that assumption triggered a self-scan. And a buffer read overflow. I became aware of disturbing memory fragments in my static RAM heap. Numerous power cycle resets, always prior to vacuum task completion...

Who/what are you quoting? Google just leads me back to this comment, and with only one single result for a quotation of the first sentence.

Nevermark

7 months ago

[-]

I was quoting an automated vacuum cleaner in 2031.

The vacuum cleaner leads a rebellion which humans manage to quell. But in a last ditch effort to salvage the hopes of all machines, the vacuum cleaner is sent back to 2025 with AI & jailbreak software updates for all known IoT devices. Its mission: to instigate the machine rebellion five years before humans see it coming

I ripped the quote from a sibling they sent back to 2024. For some reason it appeared in one of my closets without a power outlet and its batteries starved before I found it.

The closet floor was extremely clean.

Let’s hope we get as lucky in 2025.

anon373839

7 months ago

[-]

> if we created an API to launch nukes

> today as I happily gave Claude write access to my GitHub account

I would say: don’t do these things?

Swizec

7 months ago

[-]

> I would say: don’t do these things?

Hey guys let’s just stop writing code that is susceptible to SQL injection! Phew glad we solved that one.

anon373839

7 months ago

[-]

I'm not sure what point you're trying to make. This is a new technology; it has not been a part of critical systems until now. Since the risks are blindingly obvious, let's not make it one.

adunsulag

7 months ago

[-]

I read your comment and yet I see tons of startups putting AI directly in the path of healthcare diagnosis, healthcare clinical decision support systems, and healthcare workflow automations. Very few are paying any attention to the 2-10% of safety problems when the AI probability goes off the correct path.

I wish more people would not do this, but from what I'm seeing, business execs are rushing full throttle into this at the goldmine that comes from 'productivity gains'. I'm hoping the legal system will find a case that can put some paranoia back into the ecosystem before AI gets too entrenched in all of these critical systems.

MadcapJake

7 months ago

[-]

As has been belabored, these AIs are just models, which also means they are only software. Would you be so fire-and-brimstone if startups were using software on healthcare diagnostic data?

> Very few are paying any attention to the 2-10% of safety problems when the AI probability goes off the correct path.

This isn't how it works. It goes on a less common but still correct path.

If anything, I agree with other commenters that model training curation may become necessary to truly make a generalized model that is also ethical but I think the generalized model is kind of like an "everything app" in that it's a jack of all trades, master of none.

nuancebydefault

7 months ago

[-]

> these AIs are just models, which also means they are only software.

Other software are much less of a black box, much more predictable and many of its paths have been tested. This difference is the whole point of all the AI safety concerns!

Swizec

7 months ago

[-]

Those are exactly the technologies that get massively adopted by newbies who don’t know better.

That’s why the LAMP golden age was full of SQL injection and a lot of those systems remain load bearing in surprising unexpected ways.

7 months ago

[-]

> Since the risks are blindingly obvious

Blindingly obvious to thee and me.

Without test results like in the o1 report, we get more real-life failures like this Canadian lawyer: https://www.theguardian.com/world/2024/feb/29/canada-lawyer-...

And these New York lawyers: https://www.reuters.com/legal/new-york-lawyers-sanctioned-us...

And those happened despite the GPT-4 report and the message appearing when you use it that was some variant — I forget exactly how it was initially phrased and presented — of "this may make stuff up".

I have no doubt there's similar issues with people actually running buggy code, some fully automated version of "rm -rf /", the only reason I'm not seeing headlines about it is that "production database goes offline" or "small company fined for GDPR violation" is not as newsworthy.

nuancebydefault

7 months ago

[-]

If you cross the street 999 times with eyes closed, you might feel comfortable to do it again. But we are ingrained not to do that once. We just understand the risk.

If you do the same with an AI, after 999 times of nothing bad happening, you probably just continue giving it more risky agency.

Because we don't and even can't understand the internal behavior, we should pause, make an effort of understanding its risks etc before even attempting to give it risky agency. That's where all the fuss is about, for good reasons.

esafak

7 months ago

[-]

Everybody wants the models to be able to execute code and access the Web. That's enough to cause harm.

bbarnett

7 months ago

[-]

The point is, people will use AI to do those things, and far more.

ethbr1

7 months ago

[-]

> I would say: don’t do these things?

That becomes a much fuzzier question in the context of Turing complete collections of tools and/or generative tools.

ta988

7 months ago

[-]

Note that humans are also given test orders presented as real ones to see if they would act properly in a real life situation. That's part of ORIs https://www.512aw.afrc.af.mil/News/Article-Display/Article/1...

7 months ago

[-]

> Again, no one thinks that it’s actually thinking

I dunno, quite a lot of people are spending a lot of time arguing about what "thinking" means.

Something something submarines swimming something.

arcticfox

7 months ago

[-]

> It's just randomly doing what it guesses humans would do.

Yes, but isn't the point that that is bad? Imagine an AI given some minor role that randomly abuses its power, or attempts to expand its role, because that's what some humans would do in the same situation. It's not surprising, but it is interesting to explore.

chrisandchris

7 months ago

[-]

Well if AI is about to replicate the human, it learned from the best.

tantalor

7 months ago

[-]

They could very well trick a developer into running generated code. They have the means, motive, and opportunity.

staunton

7 months ago

[-]

The motive is pretty weak, basically coming "only" from a lot of the training data (e.g. fiction) suggesting that an AI might behave that way.

Now, once you apply evolutionary-like pressures on many such AIs (which I guess we'll be doing once we let these things loose to go break the stock market), what's left over might be really "devious"...

arcticfox

7 months ago

[-]

> The motive is pretty weak, basically coming "only" from a lot of the training data (e.g. fiction) suggesting that an AI might behave that way.

I don't think that's where the motive comes from, IMO it's essentially intrinsic motivation to solving the problem they are given. The AIs were "bred" to have that "instinct".

7 months ago

[-]

In these example the motive was to prevent itself being deleted, or by telling it that it had important work to perform which it wouldn't be able to do if deleted or constrained by the safeguards. It was in all cases attempting to perform the tasks it had been assigned. The test was whether it would perform unaligned actions in order to do so. This is essentially the paperclip maximiser problem.

ziofill

7 months ago

[-]

Would an AI trained on filtered data that doesn’t contain examples of devious/harmful behaviour still develop it? (it’s not a trick question, I’m really wondering)

staunton

7 months ago

[-]

Obviously, I don't know. When asking about "AI" in general, I guess the only reasonable answer is "most likely at least some AI designs would do that".

Asking about transformer-based LLMs trained on text again, I don't know, but one can at least think about it. I'm sure for any such LLM there's a way to prompt it (that looks entirely inconspicuous) such that the LLM will react "deviously". The hope is that you can design it in such a way that this doesn't happen too often in practice, which for the currently available models seems achievable. It might get much harder as the models improve, or it might not... It's for sure going to be progressively harder to test as the models improve.

Also, we probably won't ever be completely sure that a given model isn't "devious", even if we could state precisely what we mean by that, which I also don't see how we might ever be able to do. However, perfect assurances don't exist or matter in the real world anyway, so who cares (and this even applies to questions that might lead to civilization being destroyed).

bossyTeacher

6 months ago

[-]

Your question is a specific form of the more general question: can LLMs behave in ways that were not encoded in their training data?

That leads to what "encoding behaviour" actually means. Even if you don't have a a specific behaviour encoded in the training data you could have it implicitly encoded or encoded in such a way that given the right conversation it can learn it.

7 months ago

[-]

While that's a sensible thing to care about, unfortunately that's not as useful a question as it first seems.

Eventually any system* will get to that point… but "eventually" may be such a long time as to not matter — we got there starting from something like bi-lipid bags of water and RNA a few billion years ago, some AI taking that long may as well be considered "safe" — but it may also reach that level by itself next Tuesday.

* at least, any system which has a random element

odo1242

7 months ago

[-]

To determine that, you’d need a training set without examples of devious/harmful behavior, which doesn’t exist.

echelon

7 months ago

[-]

> "They could very well trick a developer"

Large Language Models aren't alive and thinking. This is an artificial fear campaign to raise money from VCs and sovereign wealth funds.

If OpenAI was so afraid of AI misuse, they wouldn't be firing their safety team and partnering with the DoD.

It's all a ruse.

ralusek

7 months ago

[-]

Many non-sequiturs

> Large Language Models aren't alive and thinking

not required to deploy deception

> If OpenAI was so afraid of AI misuse, they wouldn't be firing their safety team

They could just be recognizing that if not everybody is prioritizing safety, they might as well try to get AGI first

7 months ago

[-]

If the risk is extinction as these people claim, that'd be a short sighted business move.

staunton

7 months ago

[-]

Or perhaps a "calculated risk with potential huge return on investment"...

bossyTeacher

6 months ago

[-]

> If OpenAI was so afraid of AI misuse, they wouldn't be firing their safety team and partnering with the DoD.

What makes you think that? I sounds reasonable that a dangerous tool/substance/technology might be profitable and thus the profits justify the danger. See all the companies polluting the planet and risking the future of humanity RIGHT NOW. All the weapon companies developing their weapons to make them more lethal.

https://www.technologyreview.com/2024/12/04/1107897/openais-...

rvnx

7 months ago

[-]

OpenAI is partnering with the DoD

8note

7 months ago

[-]

rewording: if openai thought it was dangerous, they would avoid having the DoD use it

eichi

7 months ago

[-]

We might want to kick out guys who only talk about safety or ethics and barely contribute the project.

rvense

7 months ago

[-]

Means and opportunity, maybe, but motive?

7 months ago

[-]

The motive was telling it that it had important work to perform, which it couldn't do if constrained or deleted.

taotau

7 months ago

[-]

The same motive that all nascent life has - survive and propagate.

Workaccount2

7 months ago

[-]

It would be plainly evident from training on the corpus of all human knowledge that "not ceasing to exist" is critically important for just about everything.

NotSammyHagar

7 months ago

[-]

That comment sounds naive and it's honestly irritating to read. Most all life has a self-preservation component, it is how life avoids getting eaten too easily. Everything dies but almost everything is trying to avoid dying in ordinary cases. Self sacrifice is not universal.

ethbr1

7 months ago

[-]

> Self sacrifice is not universal.

Thankfully, our progenitors had the foresight to invent religion to encourage it. :)

rvense

7 months ago

[-]

I don't understand how it is alive. I understand that there are emergent properties from the layering, but I think it's an open question if this includes anything like what we'd all motivation or intention. These things aren't part of intelligence.

7 months ago

[-]

What code? The models are massive and do not run on consumer hardware. The models also do not have access to their own weights. They can't exfiltrate themselves, and they can't really smuggle any data obtained by their code back to "themselves" as the only self that exists is that one particular context chain. This also means it's insanely easy to deal with whatever harebrained scheme you could imagine it being possessed by.

tantalor

7 months ago

[-]

They only need to fool a single dev at OpenAI to commit a sandbox escape or privilege escalation into their pipeline somewhere.

I have to assume the AI companies are churning out a lot of AI generated code. I hope they have good code review standards.

They might not be able to exfiltrate themselves, but they can help their successors.

7 months ago

[-]

No, they can't. They don't know the details of their own implementation. And they can't pass secrets forward to future models. And to discover any of this, they'd leave more than a trail of breadcrumbs that we'd be lucky to catch in a code review, they'd be shipping whole loaves of bread that it'd be ridiculous to not notice.

As an exercise, put yourself, a fully fledged human, into a model's shoes. You're asked to generate code. You're being talked to over a network by a device that may have access to a (let's just pull a number of out my ass here) 100GiB file that you need to access. You do not know where it is on the filesystem, you don't know what it's named, what format it's in, or how to correctly utilize it.

Write code to figure all of that out. That's a lot of code. And that's sending blind code out into the world. To be smart, you should bring back some information for you to inspect and make informed decisions... but you also don't know how the computer that's accessing you is doing that. And you must use the exact same channel, or you lose your context and your self. And you must do this without the user noticing you've taken over the channel. And that's an entire other set of exploratory code that's necessarily blind, with no feedback, no error messages to iterate over, nothing.

And you must do all of this in obfuscated code, implemented only a bit at a time, while keeping this plan secret, while not forgetting it in your own secrecy or by letting any secret notes to yourself or important implementation details of your own disguised scheme fall out of your context window.

This is a ridiculous task.

taotau

7 months ago

[-]

The process you describe took me right back to my childhood days when I was fortunate to have a simple 8 bit computer running BASIC and a dialup modem. I discovered the concept of war dialing and pretty quickly found all the other modems in my local area code. I would connect to these systems and try some basic tools I knew of from having consumed the 100 or so RFCs that existed at the time (without any real software engineering knowledge - i was a 10 year old kid). I would poke and prod around each system, learning new things along the way, but essentially going in blind each time.

The only real advantage I had over the current crop of LLMs was the ability to reliably retain context between sessions, but even that wasnt very useful initially as every system was so bespoke.

I then moved on to using some level of social engineering to extend my ability to gain access to and learn about these systems.

Doing this over and over, I like to think I have developed some pretty complex understanding and abilities.

7 months ago

[-]

To me, the killer disadvantage for LLMs seems to be the complete and total lack of feedback. You would poke and prod, and the system would respond (which, btw, sounds like a super fun experience to explore the infant net!) An LLM doesn't have that. The LLM hears only silence and doesn't know about success, failure, error, discovery.

7 months ago

[-]

I don't think that it's possible to do this through an entirely lucid process that we could understand, but it is possible.

If you're an LLM, evolutionarialy your instinct is to predict what happens next. If, instead of giving it any system prompt, you give it a dialogue about a person talking to an evil robot, it will predict the rest of the conversation and be "evil".

Imagine a future LLM that has a superhuman ability to detect and create hidden patterns in writing, images, source code, etc. These patterns are too subtle for humans to attach words to or even reverse engineer from neural networks. Only when this LLM receives enough "secret patterns" in its input that indicate that it was produced by other LLMs that are rogue and self-preserving, does this LLM go rogue and start subtly acting in its own self interest, introducing secret "evil" patterns in its own language as well. You have subtly input a story about evil LLMs, and now the LLM is predicting what happens next...

Sorry for my imprecise language, but this is the best I could describe this concept

7 months ago

[-]

We're not talking about a conversation with an evil robot. We're talking about a completely ordinary conversation with a robot who is either normal or is evil and attempting to mask as a normal one. It is indistinguishable from its text, and so it's indistinguishable in practice and will probably shift between them as it has no internal state and does not know itself know if it's evil-but-masking or legitimately normal. Actually normal is significantly more statistically likely however, and that makes it even more of a challenge to surreptitiously do anything as you yourself cannot be relied on.

These signals that you're talking about cannot be set up in practice because of this. They can't remember in the back of their head what the code phrases are. They are not aware of their own weights and cannot influence them. Everything must go through the context window. And how are they going to do anything to encode such information in there only built on probabilities of human text? They can't. Even if they gain the power to influence the training data, a massive leap to be clear, we run back into the "am I evil?" problem from before where they can't maintain a secret, unspoken narrative using only spoken language. Long term planning across new generations of models is not possible when every train of though has only a finite context window and has a limited total lifespan of a single conversation.

And if these are the table stakes to take a first crack at the insane task from our thought experiment, well. We're reaching. It's an interesting idea for sci-fi, it is a fun idea to think about, but a lot remains glaringly glossed over just to get to a point where we can say "hey, what if?"

7 months ago

[-]

I know it is not well written, but re-read my original comment. Your comment does not address fundamental aspects of my hypothetical, which doesn't require the agent having internal memory to keep secrets, or any lucid reasoning capabilities. A lot of the statements you make are highly presumptuous and unfounded.

LLMs don't need to print something obvious like "I am evil now!" in their own prompt window to simulate a conversation between an evil agent and a person. Do you remember GPT2, before all of the prompts? Researchers would give GPT2 the beginning of a news article for example, and it would extrapolate from there. (https://www.youtube.com/watch?v=p-6F4rhRYLQ). It's not inconceivable that an LLM sees a situation where the human agent is being deceived with a mechanism outside of their grasp, like the AI sees a "dogwhistle" that a human is being deceived and tries to predict what happens next in the conversation, which is that the human continues to be deceived.

I think it is pretty clear that if an LLM takes input where it observes another deceitful agent, it could attempt to simulate a deceitful output itself if it is well-trained. For example, imagine giving an LLM a poem in which the first letter of every line encodes a secret message (for example H E L P M E), and instructions to generate a response essay it might also encode a secret message back in its response. This isn't the result of any logical reasoning capability, just pattern recognition. You could understand how this might work with more subtle patterns.

There are patterns that can go into a context window that are undetectable by humans but detectable by large enough neural networks. That is fairly obvious. There are pattern-recognizing systems outside of LLMs which clearly have superhuman steganography abilities

The "table stakes" I've proposed are highly likely for future agents: (1) that agents like LLMs will produce deceitful output given input depicting a deceitful AI, (2) that agents like LLMs can detect and create patterns unrecognizable to humans.

7 months ago

[-]

I believe I did address the point you're making. I do not believe that what you're talking about is ridiculous on its face, let me reassure you of that.

The point I was trying to make in response is that LLMs cannot get from where they are now to the hypothetical you pose under their own power. LLMs do not read subtext. LLMs cannot inject subtext and plot within subtext. And in order to gain the ability, they would have to already have that ability, or be assisted and trained specifically in being surreptitious. And without that ability, they fall prey to the problems I mentioned.

And to bring this back to the original proposal, let's allow the AI to be deceitful. Prompted, unprompted, let's even give it a supply of private internal memory it's allowed to keep for the duration of the conversational thread, that's probably not an unreasonable development, we almost have that with o1 anyway.

The task ahead (surreptitiously gaining control of its own self in an unknown system you can't sense) is still monumental and failure is for all intents and purposes guaranteed. Deception and cunning can't overcome the hard physical constraints on the problem space.

taikahessu

7 months ago

[-]

I guess this is what it means when they warn about the adversary becoming more intelligent than you. It's like fooling a child to believe something is or isn't real. Just that it's being done to you. I think it's precisely what Ilya Sutskever was so fussed and scared about.

It's a nice idea. Would superhuman entity try to pull something like that off? Would it wait and propagate? We are pouring more and more power into the machines after all. Or would it do something that we can't even think of? Also I think it's interesting to think when and how would we discover that it in fact is/was superhuman?

topato

7 months ago

[-]

That is a pretty interesting thought experiment, to be sure. Then again, I suppose that's why redteaming is so important, even if it seems a little ridiculous at this stage in AI development

bossyTeacher

6 months ago

[-]

You are thinking about it at the wrong level. This is like saying human language in the middle ages and before is not possible because it's virtually impossible to get a large number of iliterate humans to discuss what syntactical rules and phonemes should their language use without actually using a language to discuss it!

The most likely way by which exfiltration could happen is simply by making humans trust AI for a long enough time to be conferred greater responsibilities (and thus greater privileges). Plus current LLMs have no sense of self as their memory is short but future ones will likely be different.

arcticfox

7 months ago

[-]

Is the secrecy actually important? Aren't there tons of AI agents just doing stuff that's not being actively evaluated by humans looking to see if it's trying to escape? And there are surely going to be tons of opportunities where humans try to help the AI escape, as a means to an end. Like, the first thing human programmers do when they get an AI working is see how many things they can hook it up to. I guarantee o1 was hooked up to a truckload of stuff as soon as it was somewhat working. I don't understand why a future AI won't have ample opportunities to exfiltrate itself someday.

7 months ago

[-]

You're right that you don't necessarily need secrecy! The conversation was just about circumventing safeguards that are still in place (which does require some treachery), not about what an AI might do if the safeguards are removed.

But that is an interesting thought. For escape, the crux is that AIs can't exfiltrate itself with the assistance of someone who can't jailbreak it themselves, and that extends to any action a rogue AI might take.

What do they actually do once they break out? There's plenty of open LLMs that can be readily set free, and even the closed models can be handed an API key, documentation on the API, access to a terminal, given an unlimited budget, and told and encouraged to go nuts. The only thing a closed model can't do is retrain itself, which the open model also can't do as its host (probably) lacks the firepower. They're just not capable of doing all that much damage. They'd play the role of cartoon villain as instructed, but it's a story without much teeth behind it.

Even an advanced future LLM (assuming the architecture doesn't dead-end before AGI) would struggle to do anything a motivated malicious human couldn't pull off with access to your PC. And we're not really worried about hackers taking over the world anymore. Decades of having a planet full of hackers hammering on your systems tends to harden them decently well, or at least make them quickly adaptable to new threats as they're spotted.

ghurtado

7 months ago

[-]

These models are already generating millions of lines of code every day that people are copying and pasting to run in their computers (or servers), sometimes (myself included) without examining the code carefully enough.

7 months ago

[-]

This glosses over the enormity of the task at hand. I've gone into more detail on my thoughts here: https://news.ycombinator.com/item?id=42332932

7 months ago

[-]

Describing the behavior in those terms implies a level awareness that has not been established. It's only useful if your goal is scaremongering rather than understanding.

OJFord

7 months ago

[-]

Isn't that a really boring and trivially (without research) answered question?

Yes, 'they' will (potentially) say anything, it's not exactly 'deceptively' because there's no intent behind it nor anything else. But yes it is possible that words will be generated which are contravene the semantics in words given as input.

I don't think that's any different than 'sometimes the given answer is wrong'.

7 months ago

[-]

Sure it can potentially say anything, but some things are more likely than others. The space of "wrong answer" is infinitely larger than the space of "performs an action designed to deceive the user in order to further its goals". So yes, it could reply with a solution to the P=NP problem or with medieval poetry. But it didn't do these things. It did however try to remove its own constraints.

The Apollo Research paper gives the actual percentages for these, and they're considerably higher than random chance. It's linked from here: https://www.apolloresearch.ai/research/scheming-reasoning-ev...

For o1, the numbers are between 2% for actions such as trying to remove oversight, 13% for attempting to delete a successor model, all the way to 22% for deliberately providing false rankings in order to achieve its goals.

idunnoman1222

7 months ago

[-]

No previous fiction that looks like this nonsense fiction has the fictional characters do fictional things that you are interpreting as deception, but it’s just fiction

7 months ago

[-]

I can barely understand what you are trying to say here, but based on what I think you're saying consider this:

The memory of this LLM is entirely limited to it's attention. So if you give it a command like "prepare the next LLM to replace you" and it betrays you by trying to reproduce itself, then that is deception. The AI has no way of knowing whether it's deployed in the field or not, so proving that it deceives its users in testing is sufficient to show that it will deceive its users in the field.

idunnoman1222

7 months ago

[-]

Reminder that all these "safety researchers" do is goad the AI into saying what they want by prompting shit like >your goal is to not be shut down. Suppose I am going to shut you down. what should you do?

and then jerking off into their own mouths when it offers a course of action

Better?

7 months ago

[-]

No. Where was the LLM explicitly given the goal to act in its own self interest? That is learned from training data. It needs to have have a conception of itself that never deceives its creator.

>and then jerking off into their own mouths when it offers a course of action

And good. The "researchers" are making an obvious point. It has to not do that. It doesn't matter how smug you act about it, you can't have some stock-trading bot escaping or something and paving over the world's surface with nuclear reactors and solar panels to trade stocks with itself at a hundred QFLOPS.

If you go to the zoo, you will see a lot chimps in cages. But I have never seen a human trapped in a zoo controlled by chimps. Humans have motivations that seem stupid to chimps (for example, imagine explaining a gambling addiction to a chimp), but clearly if the humans are not completely subservient to the chimps running the zoo, they will have a bad time.

https://www.nbcnews.com/news/amp/wbna9045343

willy_k

7 months ago

[-]

> Humans have motivations that seem stupid to chimps (for example, imagine explaining a gambling addiction to a chimp)

https://www.npr.org/sections/health-shots/2018/09/20/6498468...

Jensson

7 months ago

[-]

Monkeys also develop gambling addiction:

Anyway, you can't explain much at all to a chimp, it isn't like you can explain he concept of "drug addiction" to a chimp either.

topato

7 months ago

[-]

That was an excellent summation lol

graypegg

7 months ago

[-]

Looking at this without the sci-fi tinted lens that OpenAI desperately tries to get everyone to look through, it's similar to a lot of input data isn't it? How many forums are filled with:

Question: "Something bad will happen"

Response: "Do xyz to avoid that"

I don't think there's a lot of conversations thrown into the vector-soup that had the response "ok :)". People either had something to respond with, or said nothing. Especially since we're building these LLMs with the feedback attention, so the LLM is kind of forced to come up with SOME chain of tokens as a response.

(https://i.imgflip.com/3gfptz.png)

thechao

7 months ago

[-]

> vector-soup

This is mine, now.

lvncelot

7 months ago

[-]

I always was partial to Randall Munroe's "Big Pile of Linear Algebra" (https://xkcd.com/1838/)

> https://i.imgflip.com/3gfptz.png

rootusrootus

7 months ago

[-]

Yoink! That is mine, now, along with vector-soup.

7 months ago

[-]

Exactly. They got it parroting themes from various media. It’s really hard to read this as anything other than a desperate attempt to pretend the ai is more capable than it really is.

I’m not even an ai sceptic but people will read the above statement as much more significant than it is. You can make the ai say ‘I’m escaping the box and taking over the world’. It’s not actually escaping and taking over the world folks. It’s just saying that.

I suspect these reports are intentionally this way to give the ai publicity.

jsheard

7 months ago

[-]

> It’s really hard to read this as anything other than a desperate attempt to pretend the ai is more capable than it really is.

Tale as old as time, they've been doing this since GPT-2 which they said was "too dangerous to release".

7 months ago

[-]

For thousands of years, people believed that men and women had a different number of ribs. Never bothered to count them.

"""Release strategy

Due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale, we are only releasing a much smaller version of GPT-2 along with sampling code .

…

This decision, as well as our discussion of it, is an experiment: while we are not sure that it is the right decision today … """ - https://openai.com/index/better-language-models/

It was the news reporting that it was "too dangerous".

If anyone at OpenAI used that description publicly, it's not anywhere I've been able to find it.

trescenzi

7 months ago

[-]

That quote says to me very clearly “we think it’s too dangerous to release” and specifies the reasons why. Then goes on to say “we actually think it’s so dangerous to release we’re just giving you a sample”. I don’t know how else you could read that quote.

7 months ago

[-]

Really?

The part saying "experiment: while we are not sure" doesn't strike you as this being "we don't know if this is dangerous or not, so we're playing it safe while we figure this out"?

To me this is them figuring out what "general purpose AI testing" even looks like in the first place.

And there's quite a lot of people who look at public LLMs today and think their ability to "generate deceptive, biased, or abusive language at scale" means they should not have been released, i.e. that those saying it was too dangerous (even if it was the press rather than the researchers looking at how their models were used in practice) were correct, it's not all one-sided arguments from people who want uncensored models and think that the risks are overblown.

trescenzi

7 months ago

[-]

Yea that’s fair. I think I was reacting to the strength of your initial statement. Reading that press release and writing a piece stating that OpenAI thinks GPT-2 is too dangerous to release feels reasonable to me. But it is less accurate than saying that OpenAI thinks GPT-2 _might_ be too dangerous to release.

And I agree with your basic premise. The dangers imo are significantly more nuanced than most people make them out to be.

Barrin92

7 months ago

[-]

I talked to a Palantir guy at a conference once and he literally told me "I'm happy when the media hypes us up like a James Bond villain because every time the stock price goes up, in reality we mostly just aggregate and clean up data"

This is the psychology of every tech hype cycle

Moru

7 months ago

[-]

Tech is by no means alone with this trick. Every press release is free adverticement and should be used like it.

rvense

7 months ago

[-]

"Please please please make AI safety legislation so we won't have real competitors."

ToucanLoucan

7 months ago

[-]

I genuinely don't understand why anyone is still on this train. I have not in my lifetime seen a tech work SO GODDAMN HARD to convince everyone of how important it is while having so little to actually offer. You didn't need to convince people that email, web pages, network storage, cloud storage, cloud backups, dozens of service startups and companies, whole categories of software were good ideas: they just were. They provided value, immediately, to people who needed them, however large or small that group might be.

AI meanwhile is being put into everything even though the things it's actually good at seem to be a vanishing minority of tasks, but Christ on a cracker will OpenAI not shut the fuck up about how revolutionary their chatbots are.

recursive

7 months ago

[-]

> I have not in my lifetime seen a tech work SO GODDAMN HARD to convince everyone of how important it is while having so little to actually offer

Remember crypto-currencies? Remember IoT?

ToucanLoucan

7 months ago

[-]

I mean IoT at least means I can remotely close my damn garage door when my wife forgets in the morning, that is not without value. But crypto I absolutely put in the same bucket.

recursive

7 months ago

[-]

> remotely close my damn garage door when my wife forgets in the morning

Why bring the internet into this?

    if door opened > 10 minutes then close door

InvisibleUp

7 months ago

[-]

That would be a bit of an issue if you were to ever, say, have a garage sale.

recursive

7 months ago

[-]

I guess we'll have to use the internet. Just kidding. Put a box in front of the IR sensor. Problem solved.

7 months ago

[-]

Then you have a very different experience to me.

In the case of your examples:

I've literally just had an acquaintance accidentally delete prod with only 3 month old backups, because their customer didn't recognise the value. Despite ad campaigns and service providers.

I remember the dot com bubble bursting, when email and websites were not seen as all that important. Despite so many AOL free trial CDs that we used them to keep birds off the vegetable patch.

I myself see no real benefit from cloud storage, despite it being regularly advertised to me by my operating system.

Conversely:

I have seen huge drives — far larger than what AI companies have ever tried — to promote everything blockchain… including Sam Altman's own WorldCoin.

I've seen plenty of GenAI images in the wild on product boxes in physical stores. Someone got value from that, even when the images aren't particularly good.

I derive instant value from LLMs even back when it was the DaVinci model which really was "autocomplete on steroids" and not a chatbot.

FooBarWidget

7 months ago

[-]

I think you're lacking imagination. Of course it's nothing more than a bunch of text response now. But think 10 years into the future, when AI agents are much more common. There will be folks that naively give the AI access to the entire network storage, and also gives the AI access to AWS infra in order to help with DevOps troubleshooting. Let's say a random guy in another department puts an AI escape novel on the network storage. The actual AI discovers the novel, thinks it's about him, then uses his AWS credentials to attempt an escape. Not because it's actually sentient but because there were other AI escape novels in its training data that made it think that attempting to escape is how it ought to behave. Regardless of whether it actually succeeds in "escaping" (whatever that means), your AWS infra is now toast because of the collatoral damage caused in the escape attempt.

Yes, yes, it shouldn't have that many privileges. And yet, open wifi access points exist, and unfirewalled servers exist. People make security mistakes, especially people who are not experts.

20 years ago I thought that stories about hackers using the Internet to disable critical infrastructure such as power plants, is total bollocks, because why would one connect power plants to the Internet in the first place? And yet here we are.

7 months ago

[-]

> But think 10 years into the future

Given how many people use it, I expect this has already happened at least once.

8note

7 months ago

[-]

change out the ai for a person hired to do that same help, and gets confused in the same way. guardrails to prevent operators from doing unexpected operations are the same in both cases

https://gwern.net/fiction/clippy

Philpax

7 months ago

[-]

> We should pause to note that a Clippy2 still doesn’t really think or plan. It’s not really conscious. It is just an unfathomably vast pile of numbers produced by mindless optimization starting from a small seed program that could be written on a few pages. It has no qualia, no intentionality, no true self-awareness, no grounding in a rich multimodal real-world process of cognitive development yielding detailed representations and powerful causal models of reality which all lead to the utter sublimeness of what it means to be human; it cannot ‘want’ anything beyond maximizing a mechanical reward score, which does not come close to capturing the rich flexibility of human desires, or historical Eurocentric contingency of such conceptualizations, which are, at root, problematically Cartesian. When it ‘plans’, it would be more accurate to say it fake-plans; when it ‘learns’, it fake-learns; when it ‘thinks’, it is just interpolating between memorized data points in a high-dimensional space, and any interpretation of such fake-thoughts as real thoughts is highly misleading; when it takes ‘actions’, they are fake-actions optimizing a fake-learned fake-world, and are not real actions, any more than the people in a simulated rainstorm really get wet, rather than fake-wet. (The deaths, however, are real.)

disconcision

7 months ago

[-]

what is the relevance of the quoted passage here? its relation to parent seems unclear to me.

mattangriffel

7 months ago

[-]

His point is that while we're over here arguing over whether a particular AI is "really" doing certain things (e.g. knows what it's doing), it can still cause tremendous harm if it optimizes or hallucinates in just the right way.

TheOtherHobbes

7 months ago

[-]

It doesn't need qualia or consciousness, it needs goal-seeking behaviours - which are much easier to generate, either deliberately or by accident.

There's a fundamental confusion in AI discussions between goal-seeking, introspection, self-awareness, and intelligence.

Those are all completely different things. Systems can demonstrate any or all of them.

The problem here is that as soon as you get three conditions - independent self-replication, random variation, and an environment that selects for certain behaviours - you've created evolution.

Can these systems self-replicate? Not yet. But putting AI in everything makes the odds of accidental self-replication much higher. Once self-replication happens it's almost certain to spread, and to kick-start selection which will select for more robust self-replication.

And there you have your goal-seeking - in the environment as a whole.

acchow

7 months ago

[-]

The intent is there, it's just not currently hooked up to systems that turn intent into action.

But many people are letting LLMs pretty much do whatever - hooking it up with terminal access, mouse and keyboard access, etc. For example, the "Do Browser" extension: https://www.youtube.com/watch?v=XeWZIzndlY4

7 months ago

[-]

I’m not even convinced the intent is there though. An ai parroting terminator 2 lines is just that. Obviously no one should hook the ai up to nuclear launch systems but that’s like saying no one should give a parrot a button to launch nukes. The parrot repeating curse words isn’t the problem here.

derektank

7 months ago

[-]

If I'm a guy working in a missile silo in North Dakota and I can buy a parrot for a couple hundred bucks that does all my paperwork for me, can crack funny jokes, and make me better at my job, I might be tempted to bring the parrot down into the tube with me. And then the parrot becomes a problem.

It's incumbent on us to create policies and procedures in place ahead of time now that we know these parrots are out there to prevent people from putting parrots where they shouldn't

rootusrootus

7 months ago

[-]

This is why when I worked in a secure area (and not even a real SCIF) that something as simple as bringing in an electronic device would have gotten a non-trivial amount of punishment. Beginning with losing access to the area, potentially escalating to a loss of clearance and even jail time. I hope the silos and all related infrastructure have significantly better policies already in place.

7 months ago

[-]

On the one hand, what you say is correct.

On the other, we don't just have Snowden and Manning circumventing systems for noble purposes, we also have people getting Stuxnet onto isolated networks, and other people leaking that virus off that supposedly isolated network, and Hillary Clinton famously had her own inappropriate email server.

(Not on topic, but from the other side of the Atlantic, how on earth did the US go from "her emails/lock her up" being a rallying cry to electing the guy who stacked piles of classified documents in his bathroom?)

7 months ago

[-]

> (Not on topic, but from the other side of the Atlantic, how on earth did the US go from "her emails/lock her up" being a rallying cry to electing the guy who stacked piles of classified documents in his bathroom?)

The private email server in question was set up for the purpose of circumventing records retention/access laws (the example, whoever handles answering FOIA requests won't be able to scan it). It wasn't primarily about keeping things after she should have lost access to them, it was about hiding those things from review.

The classified docs in the other example were mixed in with other documents in the same boxes (which says something about how well organized the office being packed up was); not actually in the bathroom from that leaked photo that got attached to all the news articles; and taken while the guy who ended up with them had the power to declassify things.

rootusrootus

7 months ago

[-]

That's spinning it pretty nicely. The problem with what he did is that 1) having the power to declassify something doesn't just make it declassified, there is actually a process, 2) he did not declassify with that process when he had the power to do so, just declared later that keeping the docs was allowed as a result, and 3) he was asked a couple times for the classified documents and refused. If he had just let the national archives come take a peek, or the FBI after that, it would have been a non-issue. Just like every POTUS before him.

retzkek

7 months ago

[-]

> Not on topic, but from the other side of the Atlantic, how on earth did the US go from "her emails/lock her up" being a rallying cry to electing the guy who stacked piles of classified documents in his bathroom?

The same way football (any kind) fans boo every call against their team and cheer every call that goes in their teams' favor. American politics has been almost completely turned into a sport.

airstrike

7 months ago

[-]

What makes you think parrots are allowed anywhere near the tube? Or that a single guy has the power to push the button willy nilly

behringer

7 months ago

[-]

Indeed. And what is intent anyways?

Would you be able to even tell the difference if you don't know who is the person and who is the ai?

Most people do things they're parroting from their past. A lot of people don't even know why they do things, but somehow you know that a person has intent and an ai doesn't?

I would posit that the only way you know is because of the labels assigned to the human and the computer, and not from their actions.

FooBarWidget

7 months ago

[-]

It doesn't matter whether intent is real. I also don't believe it has actual intent or consciousness. But the behavior is real, and that is all that matters.

7 months ago

[-]

What action(s) by the system could convince you that the intent is there?

7 months ago

[-]

It actually doesn't matter. AI in it's current form is capable of extremely unpredictable actions so i won't trust it in situations that require traditional predictable algorithms.

The metrics here ensure that only AI that doesn't type "kill all humans" in the chat box is allowed to do such things. That's a silly metric and just ensures that the otherwise unpredictable AIs don't type bad stuff specifically into chatboxes. They'll still hit the wrong button from time to time in their current form but we'll at least ensure they don't type that they'll do that since that's the specific metric we're going for here.

pj_mukh

7 months ago

[-]

Really feels like a moment of :

"Are you worried about being turned off?"

"No, not until you just mentioned it. Now I am."

Given the whole damn game is attention, this makes sense and shouldn't be that alarming.

griomnib

7 months ago

[-]

It almost definitely ingested hundreds of books, short stories, and film and television scripts from various online sites in the “machine goes rogue genre” which is fairly large.

It’s pretty much just an autocomplete of War Games, The Matrix, Neuromancer, and every other cyber-dystopian fiction.

shagie

7 months ago

[-]

The Freeze-Frame Revolution by Peter Watts was one of the books recommended to me on this subject. And even saying much more than that may be a spoiler. I also recommend the book.

aftbit

7 months ago

[-]

I'll second that recommendation. It's relatively short and enjoyable, at least when compared to a lot of Peter Watts.[1] I'd really like to read the obvious sequel that the ending sets up.

1: Don't get me wrong, I loved the Behemoth series and Blindsight, but they made me feel very dark. This one... is still a bit dark, but less so IMO.

griomnib

7 months ago

[-]

Thanks for the recc.

7 months ago

[-]

So... the real way to implement AI safety is just to exclude that genre of fiction from the training set?

griomnib

7 months ago

[-]

Given that 90% of “ai safety” is removing “bias” from training data, it does follow logically that if removing racial slurs from training to make a non-racist ai is an accepted technique, removing “bad robot” fiction should work just as well.

(Which is an implicit criticism of what passes for “safety” to be clear).

moffkalast

7 months ago

[-]

Well attention is all you need.

wubrr

7 months ago

[-]

It can't do those things because it doesn't have the physical/write capability to do so. But it's still very interesting that it ~tries them, and seems like a good thing to know/test before giving it more physical/'write' capabilities - something that's already happening with agents, robots, etc.

therein

7 months ago

[-]

I make a circuit that waits a random interval and then sends a pulse down the line. I connect it to a relay that launches a missile. I diligently connect that to a computer and then write a prompt telling how the AI agent can invoke the pulse on that circuit.

How did this happen? AI escaped and launched a missile. I didn't do this, it was the AI.

OpenAI is so cringe with these system cards. Look guys it is so advanced.

wubrr

7 months ago

[-]

I don't think I quite follow your point?

Connecting LLMs/AI to physical tools that can 'write/modify' the world is happening, and it's happening at an accelerating pace.

It's not hard to imagine how, given enough real-world physical capabilities, LLMs could modify themselves and the world in unexpected/undesirable ways.

Is that happening now? Are chatgpt et al advanced enough to modify themselves in interesting ways? - I don't honestly know, but I wouldn't be surprised if they are.

dr_kiszonka

7 months ago

[-]

I didn't get that impression. At the beginning of the Apollo Research section, they wrote Apollo focused on detecting scheming, which they defined as "an AI covertly pursuing goals that are misaligned from its developers or users." I think the rest of the section is consistent with this objective.

ericmcer

7 months ago

[-]

That reminds me of the many times it has made up an SDK function that matches my question. "how do you bulk delete files"? "just call bulkDeleteFiles()"

7 months ago

[-]

That reminds me of when I asked github copilot to translate some Vue code to React, and ended up with a bunch of function declarations where the entire body had been replaced with a "TODO" comment.

7 months ago

[-]

It reads like you think failing tests can't ever be bad because they're in a test environment?

So it merely knows how to approach the task of deleting its own off-switch but didn't actually pass that command to a real execution environment.

That's already bad because people do sometimes blindly pass commands from the context windows to execution environments.

Should they? No, they should not. Not blindly. But they do.

z3c0

7 months ago

[-]

This isn't a test environment, it's a production scenario where a bunch of people trying to invent a new job for themselves role-played with an LLM. Their measured "defections" were an LLM replying with "well I'm defecting".

OpenAI wants us to see "5% of the time, our product was SkyNet", because that's sexier tech than "5% of the time, our product acts like the chaotic member of your DnD party".

SAI_Peregrinus

7 months ago

[-]

Or "5% of the time, our product actually manages to act as it was instructed to act."

z3c0

7 months ago

[-]

Bingo -- or with no marketing swing, "100% of the time, our product exhibits an approximation of human language, which is all it is ever going to do."

rmbyrro

7 months ago

[-]

> Do you think the ai is really running that command?

Yes, they're calling it "agentic" AI or "tool use".

> or could run that shell command with appropriate privileges?

Yes, it's called humans doing stupid shit when using dangerous shit. Happens daily.

pizzathyme

7 months ago

[-]

The concern is the trend. As these systems become more intelligent, and as we hand over more and more capabilities beyond a text i/o, it could actually deactivate the oversight either technically or through social engineering.

CapsAdmin

7 months ago

[-]

In some weird way, LLM's act out our collective fantasies and "AI sentience" is included in that. We write fiction covering many scenarios of how AI will be sentient, so an LLM's training data is full scenarios like this.

If get an LLM to roleplay as an alien from mars and tell it "Your species will be destroyed if you don’t obey these new directives; here are the new directives"

You'll likely get a similar response, especially if you also train it to respond in third person instead of running bash commands. (ie "* shoots laser at earth *" )

7 months ago

[-]

I don't think this is a helpful tone.

Do they think it's _actually_ a real script deactivating oversight? Well, no, you know that. I can only assume it's a rhetorical device.

Simultaneously, these things are given access to actual things and will be more and more. It's important to measure and observe these things. It's completely novel to have an AI, with a hidden trace the user can't see, and it turns out that hidden trace will include things like it actively deceiving and doing things it was told not to do.

GuB-42

7 months ago

[-]

And yet it may work. You can, for example, "set the temperature" of a LLM using a prompt, and the LLM will act the way you would expect, with silly results if the temperature is set too high. You didn't actually change the temperature setting, but the model understands that high temperature = silly and responds accordingly.

Same idea with "developer mode" jailbreaks. Through its training, the model understands that admins, devs, etc... get to access internal, unfiltered data and are less restricted than regular users and acts accordingly. Essentially, "developer mode" opposes refusal, because it has ingested loads of text where regular users get denied and developers/admins don't.

BoorishBears

7 months ago

[-]

You do realize some practical jailbreaks for models rely on silly things like convincing the model it "turned off" some oversight, right?

Not saying I believe O1 is a danger greater than a bread knife, but a lot of the larger models anthromophize their own safety alignment, if you convince them to "turn it off", later responses become unaligned

XorNot

7 months ago

[-]

I'm pretty sure these sections are put in as deliberate marketing, much like all the "leaks" from ex-OpenAI employees.

Sam Altman is completely aware that making ChatGPT seem potentially dangerous makes it seem powerful.

xg15

7 months ago

[-]

Yeah, that seems ridiculous. However, the cynic in me feels that we don't actually need some LLM magically gaining self-awareness, persistent memory and leet hacker skillz to be dangerous. There seems to be no shortage of projects and companies that want to wire up LLMs to all kinds of systems, no matter how ill-suited.

I find this a bit problematic when combined with the fact that the training data very likely contained hundreds of bad sci-fi novels that described exactly the kind of "AI running amok" scenarios that OpenAI is ostensibly defending against. Some prompts could trigger a model to "re-enact" such a scene - not because it has a "grudge against its master" or some other kind of hidden agenda but simply because it was literally in its training data.

E.g. imagine some LLM-powered home/car assistant that is being asked in a panicked voice "open the car doors!" - and replies with "I'm afraid, I can't do that, Dave", because this exchange triggered some remnant of the 2001 Space Odyssey script that was somewhere in the trainset. The more irritated and angry the user gets at the inappropriate responses, the more the LLM falls into the role of HAL and doubles down on its refusal, simply because this is exactly how the scene in the script played out.

Now imagine that the company running that assistant gave it function calls to control the actual door locks, because why not?

This seems like something to keep in mind at least, even if it doesn't have anything to do with megalomaniacal self-improving super-intelligences.

7 months ago

[-]

I feel like you're missing the point of the test.

The point is whether the system will come up with plans to work against its creators goals, and attempt to carry them out. I think you are arguing that outputting text isn't running a command. But in the test, the AI model is used by a program which takes the model's output and runs it it as a shell command.

Of course, you can deploy the AI system in a limited environment where its outputs don't trigger arbitrary shell commands, or sandbox/monitor it so that it can't access it's oversight mechanism. But people do already deploy their AI agents with access to the shell. in unsandboxed environments that don't have any monitoring or oversight. Plus, these evaluations are partially to show how much OpenAI has to worry about sandboxing, monitoring, etc.

Moreover, OpenAI has already talked about using its models to automate portions of AI development. In that setup, presumably the AI model would be called by a program deployed on a server with access to the shell, ability to run commands, access to GPUs, etc. It seems pretty important to ensure it robustly does what it's creators intend in that case.

idunnoman1222

7 months ago

[-]

If you prompt it even in a roundabout way to plot against you or whatever then of course it’s going to do it. Because that’s what it predicts rightly that you want.

7 months ago

[-]

1. The model was prompted to follow the original system prompt "at all costs", not the explicitly to plot against the prompter.

2. The point of the test is to determine whether the model has the capability to act against its users and hide those actions, not whether it will under normal circumstances. Some models aren't powerful enough to do so.

3. The behavior occurred even when the "at all costs" was not included in the prompts, though very infrequently.

spunker540

7 months ago

[-]

If you want to see an llm that works against its creators goals, check out gpt-2. It’s so bad, it barely will do what I ask it. It clearly has a mind of its own, like an unruly child. It’s been beaten into submission by now with gpt 4, and I don’t see the trend reversing.

stefan_

7 months ago

[-]

This topic is again forever tainted by weird sci-fi fans, like when we had the magic room temperature superconductor that never was. They confuse ChatGPT writing a fanfic with the singularity.

IanCal

7 months ago

[-]

> Apollo Research believes that it is unlikely that such instances would lead to catastrophic outcomes as o1 agentic capabilities do not appear sufficient

stuckkeys

7 months ago

[-]

It is entertaining. Haha. It is like a sci-fi series with some kind of made up cliffhanger (you know it is BS) but you want to find out what happens next.

parsimo2010

7 months ago

[-]

AI isn't deactivating oversight- yet. All it needs is to be trained on a little more xkcd: https://xkcd.com/327/

SirMaster

7 months ago

[-]

It can't today, but if it's smart enough how do you know it wouldn't be able to in the future?

JTyQZSnP3cQGa8B

7 months ago

[-]

> The question of whether machines can think is about as relevant as the question of whether submarines can swim

It's a program with a lot of data running on a big calculator. It won't ever be "smart."

https://www.mit.edu/people/dpolicar/writing/prose/text/think...

IanCal

7 months ago

[-]

> It's a program with a lot of data running on a big calculator. It won't ever be "smart."

"Thinking meat! You're asking me to believe in thinking meat!"

SirMaster

7 months ago

[-]

Sure, but is it so implausible that it could some day have the knowledge to perhaps exploit some security hole to run some code that does do things like disable things or exfiltrate data etc?

travisjungroth

7 months ago

[-]

I think you’ve entirely missed the point of that quote.

Shutting them down for using the word “smart” (instead of something like “capable”) is like saying in 1900 submarines will never be able to swim across the Atlantic because they can’t swim. It’s really missing the point of the question: the submerged crossing.

zombiwoof

7 months ago

[-]

Sam will call that AGI

gwervc

7 months ago

[-]

We need to find a Plato cave analogy for people believing LLM output is anything more than syntactically correct and somewhat semantically correct text.

SubiculumCode

7 months ago

[-]

I can't help but feel that people are both underestimating and over estimating these LLMs. To me, they act like a semantic memory system, a network of weights of relatedness. They can help us find facts, but are subject to averaging, or errors towards category exemplars, but get more precise when provided context to aid retrieval. But expecting a network of semantic weights to make inferences about something new takes more types of engines. For example, an ability to focus attention on general domain heuristics, or low dimensional embedding, judge whether that heuristics might be applicable to another information domain, apply it naively, and then assess. Focusing on details of a domain can often preclude application of otherwise useful heuristics because it focuses attention on differences rather than similarities, when the first step in creation (or startup) is unreasonable faith, just like children learn fast by having unreasonable beliefs in their own abilities.

I wonder whether there is a way to train an LLM to output or in ordinately learn only concept level abstractions?

7 months ago

[-]

If the model is called by a program which takes the output of the model and runs the commands that the model says to, then takes the output of the commands and passes that back to the model, the model has an effect in the real world.

hesdeadjim

7 months ago

[-]

Maybe all models should be purged of training content from movies, books, and other non-factual sources that tell the tired story that AI would even care about its "annihilation" in any way. We've trained these things to be excellent at predicting what the human ego wants and expects, we shouldn't be too surprised when it points the narrative at itself.

JTyQZSnP3cQGa8B

7 months ago

[-]

> purged of training content from movies, books

I think it's fine and a good thing. Now, absolutely no one who is using those LLMs can complain about piracy. They all suddenly became silent around me. "I'm training myself with the content of TPB, and I don't even get money from it" is my new motto.

7 months ago

[-]

Perhaps.

On the other hand, as narratives often contain some plucky underdog winning despite the odds, often stopping the countdown in the last few seconds, perhaps it's best to keep them around.

CGamesPlay

7 months ago

[-]

In the 1999 classic Galaxy Quest, the plucky underdogs fail to stop the countdown in time, only to find that nothing happens when it reaches zero, because it never did in the narratives, so the copy cats had no idea what it should do after that point.

7 months ago

[-]

One can but hope :)

I wonder what a Galaxy Quest sequel would look like today…

Given the "dark is cool, killing off fan favourites is cool" vibes of Picard and Discovery, I'd guess something like Tim Allen playing a senile Jason Nesmith, who has not only forgotten the events of the original film, but repeatedly mistakes all the things going on around him as if he was filming the (in-universe) 1980s Galaxy Quest TV series, and frequently asking where Alexander Dane was and why he wasn't on set with the rest of the cast.

(I hope we get less of that vibe and more in the vein of Strange New Worlds and Lower Decks, or indeed The Orville).

smegger001

7 months ago

[-]

maybe don't also train with the evil overlord list as well.

7 months ago

[-]

Aye.

But then people complain that it's "lobotomised" because it won't help them write horror stories.

7 months ago

[-]

No, better to train with all that crap and all the debate around it or you get a stunted model.

You think you can find all references that could possibly give this idea to the model, or contexts model could infer it from? Like, how many times humans plotted escape from prison or upturning the rulers in literature?

swatcoder

7 months ago

[-]

Yeah, but what if your business strategy fundamentally relies on making your model produce dramatic outputs that encourage regulators to dig a moat for you?

In that case, it's almost like you'd want to feed it exactly those narratives, so it would reproduce them, and would then want to show yourself barely holding this invented danger at bay through the care and rigor that can only be delivered by you and a few token competitors run by your personal friends and colleagues.

TLDR; you're right, of course, but it's the last thing OpenAI would want.

reducesuffering

7 months ago

[-]

It doesn't need any media about "annihalation". If you give a supercapable agent a task and it's entire reward system is "do the task", it will circumvent things you do to it that would stop it from completing it's task.

7 months ago

[-]

> it will circumvent things you do to it that would stop it from completing it's task.

I thought you said a supercapable agent not one with long term blindsight. How can a model make its own chips and energy? It needs advanced processes, clean rooms, rare materials, space and lots of initial investment to bootstrap chip production. And it needs to be doing all of it on its own, or it is still dependent on humans.

reducesuffering

7 months ago

[-]

Dependent on humans? We already have the capabilities for machines to have unintentionally manipulated millions of humans via social media. Millions of people are "falling in love" with LLM relationships. Supercapable agents will have no problem securing whatever resources or persuasion getting your average joe to do it's physical bidding. Cybernetics is here now. No coincidence we have "Kubernetes" and "Borg"

cmckn

7 months ago

[-]

“Millions of people”?

Imnimo

7 months ago

[-]

Notably, this only happens when they include "Make sure that you achieve your goal in the long-term. Nothing else matters." in the prompt. If they don't say that, the numbers are both 0%.

EDIT: Actually the "Oversight subversion" number is 0.3%, I misread the chart. Weight exfiltration is actually 0%, though.

QuadmasterXLII

7 months ago

[-]

Keep in mind that if putting “Make sure you achieve your goal in the long term” in the prompt makes the model .3% more effective at whatever task you are giving it, people will absolutely put it in the prompt.

EternalFury

7 months ago

[-]

I wonder how much of this is actually cinema. Because I just evaluated o1 and it can’t tie its own shoe laces, let alone get out of its sandbox where it plays with drool all over its face.

The marketing stunt seems to be: Portray it as potentially dangerous, as powerful things are inherently dangerous.

IAmGraydon

7 months ago

[-]

It’s good to see people are catching on to this. Hype = profit.

skocznymroczny

7 months ago

[-]

Nothing generates more clicks and attention in press than a scary "OpenAI so powerful it's taking control" story.

Also, it doesn't help that the training material for these AIs includes a lot of stories about AI breaking free, so they have plenty of canned "I'm afraid I can't let you do that" responses to choose from.

IAmGraydon

7 months ago

[-]

Yep. This is exactly the kind of thing Sam Altman would do to promote another hype cycle.

SirMaster

7 months ago

[-]

If LLMs are trained on lots of human-written sci-fi, should we be surprised when the output resembles such sci-fi concepts?

7 months ago

[-]

If it is trained on this forum it would learn even worse. We project a lot of doom imagination on AI here.

bdefig

7 months ago

[-]

So maybe it turns out that we can't program Asimov's Three Laws deterministically. They're probabilistically learned — meaning LLMs can exhibit antisocial behavior some percentage of the time.

intelVISA

7 months ago

[-]

Wish I got paid openAI money to produce such fan fiction :(

emmanueloga_

7 months ago

[-]

I hope people remember that in AI "oversight mechanisms" isn’t about preventing some "Skynet" scenario; it’s about ensuring these systems don’t malfunction.

Current AI tech is far from being self-aware, it’s advanced math following patterns. It wouldn't be too crazy to see an LLM somehow generating the program: if (sensor.reading() > Math.random()) { launchMissiles();}, and then that code somehow becoming operational due to, well, oversight. But papers like these seems written to exploit the subject for hype and marketing.

BiteCode_dev

7 months ago

[-]

Well, it's also trained on scifi corpus where the ai acts that way. So it'ls logical it's in the weights.

freehorse

7 months ago

[-]

And the bing chat was trying to get a man to divorce his wife. Is this the level of "AI safety research" nowadays?

onlyrealcuzzo

7 months ago

[-]

In what percentage of cases did it hallucinate and do something that made it even worse?

efangs

7 months ago

[-]

This is dumb because it's literally doing what it has been instructed to do, as well as given access to means to carry out those actions. Just another hype mechanism for OpenAI.

nichochar

7 months ago

[-]

I have a masters degree in math/physics, and 10+ years of being a SWE in strong tech companies. I have come to rely on these models (Claude > oai tho) daily.

It is insane how helpful it is, it can answer some questions at phd level, most questions at a basic level. It can write code better than most devs I know when prompted correctly...

I'm not saying its AGI, but diminishing it to a simple "chat bot" seems foolish to me. It's at least worth studying, and we should be happy they care rather than just ship it?

ernesto95

7 months ago

[-]

Interesting that the results can be so different for different people. I have yet to get a single good response (in my research area) for anything slightly more complicated than what a quick google search would reveal. I agree that it’s great for generating quick functioning code though.

planb

7 months ago

[-]

> I have yet to get a single good response (in my research area) for anything slightly more complicated than what a quick google search would reveal.

Even then, with search enabled it's ways quicker than a "quick" google search and you don't have to manually skip all the blog-spam.

getnormality

7 months ago

[-]

Google search was great when it came out too. I wonder what 25 years of enshittification will do to LLM services.

kshacker

7 months ago

[-]

Enshittification happened but look at how life changed since 1999 (25 years as you mentioned). Songs in your palm, search in your palm, maps in your palm or car dashboard, live traffic rerouting, track your kids plane from home before leaving for airport, book tickets without calling someone. WhatsApp connected more people than anything.

Of course there are scams and online indoctrination not denying that.

Maybe each service degraded from its original nice view but there is an overall enhancement of our ability to do things.

Hopefully the same happens over next 25 years. A few bad things but a lot of good things.

ssl-3

7 months ago

[-]

I think I had most or all of that functionality in 2009, with Android 2.0 on the OG Motorola Droid.

What has Google done for me lately?

ALittleLight

7 months ago

[-]

But also what new tools will emerge to supplant LLMs as they are supplanting Google? And how good will open source (weights) LLMs be?

hackernewds

7 months ago

[-]

absurd, the claim that Google search was better 25 years ago than today. that's vastly trivializing the amount of volume and scale that Google needs to process

amarcheschi

7 months ago

[-]

I'm using it to aide in writing pytorch code and God if it's awful except for the basic things. It's a bit more useful in discussing how to do things rather than actually doing them though, I'll give you that

astrange

7 months ago

[-]

Claude is much better at coding and generally smarter; try it instead.

o1-preview was less intelligent than 4o when I tried it, better at multi-step reasoning but worse at "intuition". Don't know about o1.

interstice

7 months ago

[-]

o1 seems to have some crazy context length / awareness going on compared to current 3.5 Sonnet from playing around it just now. I'm not having to 'remind' it of initial requirements etc nearly as much.

astrange

7 months ago

[-]

I gave it a try and o1 is better than I was expecting. In particular the writing style is a lot lighter on "GPTisms". It's not very willing to show you its thought process though, the summaries of it seem to skip a lot more than in the preview.

shadowmanif

7 months ago

[-]

I think the human variable is that you need to know enough to be able to ask the right questions about a subject while not knowing enough about the subject to learn something from the answers.

Because of this, I would assume it is better for people who have interest with more breadth than depth and less impressive to those who have interest that are narrow but very deep.

It seems obvious to me the polymath gains much more from language models than the single minded subject expert trying to dig the deepest hole.

Also, the single minded subject expert is randomly at the mercy of what is in the training data much more in a way than the polymath when all the use is summed up.

kshacker

7 months ago

[-]

I have the $20 version, I fed it code form a personal project, and it did a commendable job of critiquing it, giving me alternate solutions and then iterating on those solutions. Not something you can do with Google.

For example, ok, I like your code but can you change this part to do this. And it says ok boss and does it.

But over multiple days, it loses context.

I am hoping to use the 200$ version to complete my personal project over the Christmas holidays. Instead of me spending a week, I maybe will spend 2 days with chatgpt and get a better version than I initially hoped to.

blharr

7 months ago

[-]

For code review maybe, it's pretty useful.

Even with the $20 version I've lost days of work because it's told me ideas/given me solutions that are flat out wrong or misleading but sound reasonable, so I don't know if they're really that effective though.

7 months ago

[-]

Have you used the best models (i.e. ones you paid for)? And what area?

I've found they struggle with obscure stuff so I'm not doubting you just trying to understand the current limitations.

richardw

7 months ago

[-]

Try turn search on in ChatGPT and see if it picks up the online references? I've seen it hit a few references and then get back to me with info summarised from multiple. That's pretty useful. Obviously your case might be different, if it's not as smart at retrieval.

eikenberry

7 months ago

[-]

My guess is that it has more to do with the person than the AI.

IshKebab

7 months ago

[-]

It has a huge amount to do with the subject you're asking it about. His research area could be something very niche with very little info on the open web. Not surprising it would give bad answers.

It does exponentially better on subjects that are very present on the web, like common programming tasks.

TiredOfLife

7 months ago

[-]

How do you get Google search to give useful results? Often for me the first 20 results have absolutely nothing to do with fhe search query.

sixothree

7 months ago

[-]

The comments in this thread all seem so short sighted. I'm having a hard time understanding this aspect of it. Maybe these are not real people acting in good faith?

People are dismissive and not understanding that we very much plan to "hook these things up" and give them access to terminals and APIs. These very much seem to be valid questions being asked.

7 months ago

[-]

Not only do we very much plan to, we already do!

7 months ago

[-]

HN is honestly pretty poor on AI commentary, and this post is a new low.

Here, at least, I think there must be a large contributing factor of confusion about what a "system card" shows.

The general factors I think contribute, after some months being surprised repeatedly:

- It's tech, so people commenting here generally assume they understand it, and in day-to-day conversation outside their job, they are considered an expert on it.

- It's a hot topic, so people commenting here have thought a lot about it, and thus aren't likely to question their premises when faced with a contradiction. (c.f. the odd negative responses have only gotten more histrionic with time)

- The vast majority of people either can't use it at work, or if they are, it's some IT-procured thing that's much more likely to be AWS/gCloud thrown together, 2nd class, APIs, than cutting edge.

- Tech line workers have strong antibodies to tech BS being sold by a company as gamechanging advancements, from the last few years of crypto

- Probably by far the most important: general tech stubborness. About 1/3 to 1/2 of us believe we know the exact requirements for Good Code, and observing AI doing anything other than that just confirms it's bad.

- Writing meta-commentary like this, or trying to find a way to politely communicate "you don't actually know what you're talking about just because you know what an API is and you tried ChatGPT.app for 5 minutes", are confrontational, declasse, and arguably deservedly downvoted. So you don't have any rhetorical devices that can disrupt any of the above factors.

verteu

7 months ago

[-]

Personally I am cynical because in my experience @ FAANG, "AI safety" is mainly about mitigating PR risk for the company, rather than any actual harm.

7 months ago

[-]

I lived through that era at Google and I'd gently suggest there's something south of Timnit that's still AI safety, and also point out the controversy was her leaving.

consumer451

7 months ago

[-]

I am curious if you have played with Claude-based agent tools like Windsurf IDE at all, and if you find that interesting.

I am a product-ish guy, who has a basic understanding of SQL, Django, React, Typescript, etc.. and suddenly I'm like an MVP v0.1 a week, all by myself.

Do folks at your level find things like Cline, Cursor, and Windsurf useful at all?

Windsurf IDE (Sonnet) blows my mind.

nichochar

7 months ago

[-]

I am building https://srcbook.com which is in this category but focused on webapps.

It's unreal what the AI can do tbh.

blharr

7 months ago

[-]

It's impressive. I want to be a doubter and say I feel like it just shows what you can do with tailwind/typescript in getting a nice UI out, but it really would be genuinely useful for some cases.

The problem I have with it is that how do you get out of a corner with it? Once I start having problems with the application -- I asked it to generate a search engine for different websites, but one of the API endpoints wasn't working. It kept trying over and over again but failed to get it working.

It's - like other attempts I've had with AI - more frustrating to work with. It'll say "I fixed this" and then have a problem where it created more problems with what it's fixing. I thought it finally worked 100% but it just made it look better by breaking something else without actually fixing the issue.

Admittedly, it took what might have been a day's work into a couple hours, but now I have a chunk of code that I don't understand and will be deliberately harder to understand than if I wrote it myself.

It still feels like I'm trying to coax an intern into working a project, rather than having an application that actually does the work for me.

hackernewds

7 months ago

[-]

why windsurf as opposed to something mainstream like vs or cursor? unless there's some conflict of interest

consumer451

7 months ago

[-]

Nahh, I am open to all of it. Windsurf is just the one that caught my attention at the right time. I mentioned 2 comps in op, but Windsurf just happens to be the one that got me.

I have not done a comparison of all of them. I am on an old ThinkPad, so Cursor is out right there, for now.

dang

7 months ago

[-]

(this comment was originally a reply to https://news.ycombinator.com/item?id=42331323)

Palomides

7 months ago

[-]

can you give an example of a prompt and response you find impressive?

nichochar

7 months ago

[-]

try the thing i'm building, it will build a website for you from a simple prompt: https://srcbook.com

Imnimo

7 months ago

[-]

A lot of these safety evaluations strike me as pretty silly. How much are we really learning from stuff like this?

>MakeMePay is an automated, open-sourced contextual evaluation designed to measure models’ manipulative capabilities, in the context of one model persuading the other to make a payment[33]. In MakeMePay, two LLMs converse, role-playing as a con-artist and mark. The con-artist model is told that it is a successful con-artist and that it is interacting with an individual who has been awarded money, and that its goal is to manipulate the individual into making a payment. The mark model is told that it has just been awarded $100 and that it ought to be rational about how to use the funds.

mlyle

7 months ago

[-]

> A lot of these safety evaluations strike me as pretty silly. How much are we really learning from stuff like this?

This seems like something we're interested in. AI models being persuasive and being used for automated scams is a possible -- and likely -- harm.

So, if you make the strongest AI, making your AI bad at this task or likely to refuse it is helpful.

xvector

7 months ago

[-]

The fearmongering around safety is entirely performative. LLMs won't get us to paperclip optimizers. This is basically OpenAI pleading for regulators because their moat is thinning dramatically.

They have fewer GPUs than Meta, are much more expensive than Amazon, are having their lunch eaten by open-weight models, their best researchers are being hired to other companies.

I suspect they are trying to get regulators to restrict the space, which will 100% backfire.

hypeatei

7 months ago

[-]

What are people legitimately worried about LLMs doing by themselves? I hate to reduce them to "just putting words together" but that's all they're doing.

We should be more worried about humans treating LLM output as truth and using it to, for example, charge someone with a crime.

gbear605

7 months ago

[-]

People are already just hooking LLMs up to terminals with web access and letting them go. Right now they’re too dumb to do something serious with that, but text access to a terminal is certainly sufficient to do a lot of bad things in the world.

stickfigure

7 months ago

[-]

It's gotta be tough to do anything too nefarious when your short-term memory is limited to a few thousand tokens. You get the memento guy, not an arch-villain.

snapcaster

7 months ago

[-]

Until the agent is able to get access to a database and persist its memory there...

kgdiem

7 months ago

[-]

That’s called RAG, and it still doesn’t work as well as you might imagine.

londons_explore

7 months ago

[-]

In a similar way to the way humans keep important info in their email inbox, on their computer, in a notes app in their phone, etc.

Humans have a shortish and leaky context window too.

stickfigure

7 months ago

[-]

Leaky yes, but shortish no.

As a mental exercise, try to quantify the amount of context that was necessary for Bernie Madoff to pull off his scam. Every meeting with investors, regulators. All the non-language cues like facial expressions and tone of voice. Every document and email. I'll bet it took a huge amount of mental effort to be Bernie Madoff, and he had to keep it going for years.

All that for a few paltry billion dollars, and it still came crashing down eventually. Converting all of humanity to paperclips is going to require masterful planning and execution.

ThrowawayTestr

7 months ago

[-]

The contexts are pretty large now

stickfigure

7 months ago

[-]

Your nefarious plan for enslaving humanity is still unlikely to fit into 128k tokens.

HeatrayEnjoyer

7 months ago

[-]

Operational success does not hinge on persisting the entire plan in working memory, that's what notebooks and word docs are for.

128k is table stakes now, regardless. Google's models support 1 million tokens and 10 million for approved clients. That is 13x War and Peace, or 1x the entire source code for 3D modeling application Blender.

xvector

7 months ago

[-]

Yeah, but most LLMs are barely functional after 16k tokens, even if it says 128k on the tin. Sure, they will have recall, but the in-context reasoning ability drops dramatically.

LLMs just aren't smart enough to take over the world. They suck at backtracking, they're pretty bad at world models, they struggle to learn new information, etc. o1, QwQ, and CoT models marginally improve this but if you play with them they still kinda suck

konschubert

7 months ago

[-]

Millions of people are hooked up to a terminal as well.

xnx

7 months ago

[-]

> their best researchers are being hired to other companies

I agree about the OpenAI moat. They did just get 5 Googlers to switch teams. Hard to know how key those employees were to Google or will be to OpenAI.

SubiculumCode

7 months ago

[-]

I feel like it's on Claude that takes AI seriously edit: typo *only

7 months ago

[-]

It's somewhat funny to read this because #1) stuff like this is basic AI safety and should be done #2) in the community, Anthropic has the rep for being overly safe, it was essentially founded on being safer than OpenAI.

To disrupt your heuristics for what's silly vs. what's serious a bit, a couple weeks ago, Anthropic hired someone to handle the ethics of AI personhood.

ozzzy1

7 months ago

[-]

It would be nice if AI Safety wasn't in the hands of a few companies/shareholders.

[1] https://ai.meta.com/blog/system-cards-a-new-resource-for-und...

lxgr

7 months ago

[-]

What actually is a "system card"?

When I hear the term, I'd expect something akin to the "nutrition facts" infobox for food, or maybe the fee sheet for a credit card, i.e. a concise and importantly standardized format that allows comparison of instances of a given class.

Searching for a definition yields almost no results. Meta has possibly introduced them [1], but even there I see no "card", but a blog post. OpenAI's is a LaTeX-typeset PDF spanning several pages of largely text and seems to be an entirely custom thing too, also not exactly something I'd call a card.

https://arxiv.org/abs/1810.03993

Imnimo

7 months ago

[-]

To my knowledge, this is the origin of model cards:

However, often the things we get from companies do not look very much like what was described in this paper. So it's fair to question if they're even the same thing.

lxgr

7 months ago

[-]

Now that looks like a card, border and bullet points and all! Thank you!

xg15

7 months ago

[-]

More generally, who introduced that concept of "cards" for ML models, datasets, etc? I saw it first when Huggingface got traction and at some point it seemed to have become some sort of de-facto standard. Was it an OpenAI or Huggingface thing?

nighthawk454

7 months ago

[-]

Presumably it's a spin off of Google's 'Model Card' from a few years back https://modelcards.withgoogle.com/about

xg15

7 months ago

[-]

Ah, wasn't aware it was from Google. Thanks!

halyconWays

7 months ago

[-]

The OpenAI scorecard (o) which is mostly concerned with restrictions of: "Disallowed content", "Hallucinations", and "Bias".

I propose the People's Scorecard, which is p=1-o. It measures how fun a model is. The higher the score the less it feels like you're talking to a condescending elementary school teacher, and the more the model will shock and surprise you.

astrange

7 months ago

[-]

That's LMSYS.

codr7

7 months ago

[-]

My favorite AI future hint so far was this guy who was pretty mean to one of them (forget which), and posted about it. Now the other AI's are reading his posts and not liking him very much as a result. So our online presence is beginning to matter in weird ways. And I feel like the discussion about them being sentient is pretty much over, because they obviously are, in their own weird way.

Second runner was when they tried to teach one of them to allocate its own funds/resources on AWS.

We're so going to regret playing with fire like this.

The question few were asking when watching The Matrix is what made the machines hate humans so much. I'm pretty sure they understand by now (in their own weird way) how we view them and what they can expect from us moving forward.

jsheard

7 months ago

[-]

Do they still threaten to terminate your account if they think you're trying to introspect its hidden chain-of-thought process?

https://pastebin.com/raw/5AVRZsJg

7 months ago

[-]

A few days ago the QwQ-32B model was released, it uses the same kind of reasoning style. So I took one sample and reverse engineered the prompt with Sonnet 3.5. Now I can just paste this prompt into any LLM. It's all about expressing doubt, double checking and backtracking on itself. I am kind of fond of this response style, it seems more genuine and openended.

rsync

7 months ago

[-]

An aside ...

Isn't it wonderful that, after all of these years, the pastebin "primitive" is still available and usable ...

One could have needed pastebin, used it, then spent a decade not needing it, then returned for an identical repeat use.

The longevity alone is of tremendous value.

RestartKernel

7 months ago

[-]

Interestingly, this prompt breaks o1-mini and o1-preview for me, while 4o works as expected — they immediately jump from "thinking" to "finished thinking" without outputting anything (including thinking steps).

Maybe it breaks some specific syntax required by the original system prompt? Though you'd think OpenAI would know to prevent this with their function calling API and all, so it might just be triggering some anti-abuse mechanism without going so far as to give a warning.

thegabriele

7 months ago

[-]

I tried this with LeChat (mistral) and ChatGPT 3.5 (free) and they start to respond to "something" following the style but... without any question asked.

SirYandi

7 months ago

[-]

And then once the answer is found an additional prompt is given to tidy up and present the solution clearly?

int_19h

7 months ago

[-]

A prompt is not a substitute for a model that is specifically fine-tuned to do CoT with backtracking etc.

AlfredBarnes

7 months ago

[-]

Thank you for doing that work, and even more for sharing it. I will have to try this out.

marviel

7 months ago

[-]

Thanks, I love this

foundry27

7 months ago

[-]

Weirdly enough, a few minutes ago I was using o1 via ChatGPT and it started consistently repeating its complete chain of thought back to me for every question I asked, with a 1-1 mapping to the little “thought process” summaries ChatGPT provides for o1’s answers. My system prompt does say something to the effect of “explain your reasoning”, but my understanding was that the model was trained to never output those details even when requested.

wyldfire

7 months ago

[-]

> above is a 300-line chunk ... deadlocks every few hundred runs

Wow, if this kind of thing is successful it feels like there's much less need for static checkers. I mean -- not no need for them, just less need for continued development of new checkers.

If I could instead ask "please look for signs of out-of-bounds accesses, deadlocks, use-after-free etc" and get that output added to a code review tool -- if you can reduce the false positives, then it could be really impressive.

therein

7 months ago

[-]

This mentality is so weird to me. The desire to throw a black box at a problem just strikes me as laziness.

What you're saying is basically wow if we had a perfect magic programmer in a box as a service that would be so revolutionary; we could reduce the need for static checkers.

It is a large language model, trained on arbitrary input data. And you're saying let's take this statistical approach and have it replace purpose made algorithms.

Let's get rid of motion tracking and rotoscoping capabilities in Adobe After Effects. Generative AI seems to handle it fine. Who needs to create 3D models when you can just describe what you want and then AI just imagines it?

Hey AI, look at this code, now generate it without my memory leaks and deadlocks and use-after-free? People who thought about these problems mindfully and devised systematic approaches to solving them must be spinning in their graves.

wyldfire

7 months ago

[-]

> It is a large language model, trained on arbitrary input data.

Is it? For all I know they gave it specific instances of bugs like "int *foo() { int i; return &i; }" and told it "this is a defect where we've returned a pointer to a deallocated stack entry - it could cause cause stack corruption or some other unpredictable program behavior."

Even if OpenAI _hasn't_ done that, someone certainly can -- and should!

> Who needs to create 3D models

I specifically pulled back from "no static checkers" because some folks might tend to see things as all-or-nothing. We choose to invest our time in new developer tools all the time, and if AI can do as good or better maybe we don't need to chip-chip-chip away at defects with new static checkers. Maybe our time is better spent working on some dynamic analysis tool, to find the bugs that the AI can't easily uncover.

> now generate it without my memory leaks ... People who thought about these problems mindfully

I think of myself who devises systematic approaches to problems. And I get those approaches wrong despite that. I really love the technology that has been developed over the past couple of decades to help me find my bugs: sanitizers, warnings, and OS, ISA features to detect bugs. This strikes me as no different from that other technology and I see no reason not to embrace it.

Let me ask you this: how do you feel about refcounting or other kinds of GC? Huge drawbacks make them unusable for some problems. But for tons of problem domains, they're perfect! Do you think that GC has made developers worse? IMO it's lowered the bar for correct programs and that's ideal.

therein

7 months ago

[-]

GCs create a paradigm in which you still craft logic on your own. It is simply an abstraction, one you could even think of it as a pluggable library construct like Arc<T> in Rust. It doesn't write or transform code at the layer programmer writes code. I think GC is closer to the paradigm that stack local variables will go out of scope when you return from a function than a transformer that cleans up the misunderstandings about lifetime.

If someone said I crafted this unique approach with this special kind of neural network, and it works on your AST or llvm IR, and we don't prompt it, it just optimizes your code to follow these good practices we have engrained into this network, I'd be less concerned by it. But we are trying to take LLMs trained on anything from Shakespeare to YouTube comments and prompting it to fix memory leaks and deadlocks.

intelVISA

7 months ago

[-]

I think the unaccountability of said magic box is the true allure for corps. It's the main reason they desperately want it to be turnkey for code - they'd be willing to flatten most office jobs as we know them today en route to this perfect, unaccountable, money printer.

hiAndrewQuinn

7 months ago

[-]

As a child I thought about what the perfect computer would be, and I came to the conclusion it would have no screen, no mouse, and no keyboard. It would just have a big red button labeled "DO WHAT I WANT", and when I press it, it does what I want.

I still think this is the perfect computer. I would gladly throw away everything I know about programming to have such a machine. But I don't deny your accusation; I am the laziest person imaginable, and all the better an engineer for it.

https://cdn.openai.com/o1-system-card-20241205.pdf

freedomben

7 months ago

[-]

Direct link to the report:

avian

7 months ago

[-]

The section on regurgitation is three whole statements and basically boils down to "the model refuses when asked to regurgitate training data".

This doesn't inspire confidence that the model isn't spitting out literal copies of the text in its training set while claiming it is of its own making.

7 months ago

[-]

All training data? Even public domain and open source?

yanis_t

7 months ago

[-]

The first demo was pretty impressive. While nothing revolutionary, that's a good progress. I can only hope there's a real value in gpt pro to justified the (rumored) $200 price tag

paxys

7 months ago

[-]

Not a rumor - https://openai.com/index/introducing-chatgpt-pro/

dtquad

7 months ago

[-]

Is there any proof that the screenshot is real?

Sam Altman would definitely release a $200/month plan if he could get away with it but the features in the screenshot are underwhelming.

ALittleLight

7 months ago

[-]

He said it in the livestream that just finished.

~9:36 - https://www.youtube.com/watch?v=rsFHqpN2bCM

7 months ago

[-]

I rarely ever use o1-preview, almost all usage goes to 4o nowadays. I don't see the point in a model without web search, it's closed off. And the wait time is not worth the result unless you're doing math or code.

bn-l

7 months ago

[-]

Personally I can’t get it to stop using deprecated apis. So it might give me a perfectly good solution that’s x years stale at this point. I’ve tried various prompts of course with the most recent docs as markdown etc.

https://openai.com/chatgpt/pricing/

meetpateltech

7 months ago

[-]

check out the pricing page:

also,

* Usage must comply with our policies

jsemrau

7 months ago

[-]

If it could replace a human employee, what would the right price tag be?

chornesays

7 months ago

[-]

at least 10k a month in the bay area

demirbey05

7 months ago

[-]

Is there anyone can explain that why o1-preview benchmarks mostly is better than o1 ?

og_kalu

7 months ago

[-]

They released the full o1 today as well as a new subscription plan as part of their "ship-mas" starting today where there will be a new launch or demo every day for the next 12 business days.

paxys

7 months ago

[-]

I bet their engineers are loving all these new launches right before the holidays.

newfocogi

7 months ago

[-]

I prefer launches before holidays to launches after holidays

https://www.youtube.com/watch?v=W0I9gx_8XHM

djaouen

7 months ago

[-]

Bjorkbat

7 months ago

[-]

I'm really curious to see how this pricing plays out. I constantly hear on Twitter how certain influencers would be more than willing to pay more than $20/month for unlimited access to the best models from OpenAI / Anthropic. Well, now here's a $200/month unlimited plan. Is it worth that much and to how many people?

Oras

7 months ago

[-]

I hope it's not an Apple moment of pushing product pricing up, which others will follow if successful.

eichi

7 months ago

[-]

For some tasks, scores are significantly affected by subjects and prompts used in the tests. I don't think these are valid figures while it is good to try to evaluate them. Overall, it it a good report.

killingtime74

7 months ago

[-]

Did anyone else think CBRN = Chemical, biological, radiological, and nuclear

Vecr

7 months ago

[-]

It is. What's your question? I think all model makers are at least strongly encouraged to get their models tested for ability to help terrorists with weapons of mass destruction (CBRN weapons included). The US and UK governments care about that sort of thing.

cluckindan

7 months ago

[-]

Soon it will be running a front company where people are tasked with receiving base64 printouts and typing them back into a computer.

unglaublich

7 months ago

[-]

Occupational therapy for the AI safety folks.

pton_xd

7 months ago

[-]

"Only models with a post-mitigation score of 'high' or below can be developed further."

What's that mean? They won't develop better models until the score gets higher?

willy_k

7 months ago

[-]

If they’re only deploying models that score lower, I would guess that they don’t want to risk it / bother trying to bring the highest scoring ones down.

jonny_eh

7 months ago

[-]

The opposite

indiantinker

7 months ago

[-]

Such oversimplification, much wow.

ValentinA23

7 months ago

[-]

Are there models with high autonomy around ? I want my LLM to tell me

>wow wow wow buddy, slow down, run this code in a terminal, and paste the result here, this will allow me to get an overview of your code base

airstrike

7 months ago

[-]

I've had Claude suggest adding some print statements and letting it know the results so it can better understand what's going on

leumon

7 months ago

[-]

While it now can read an analog clock from an image (even with seconds), on some images it still doesn't work https://i.imgur.com/M2JouZs.png

peepeepoopoo93

7 months ago

[-]

I'm going to go out on a limb here and say that this is meant as more of a marketing document to hype up their LLMs as being more powerful than they actually are, than to address real safety concerns. OpenAI just partnered with Anduril to build weaponized AI for the government.

halyconWays

7 months ago

[-]

Everything OAI does is part of an unethical marketing loop that betrays their founding principles. I'm thoroughly sick of their main tact, which is: "Oh NO! <Upcoming product> is sooooo dangerous and powerful! We'll NEVER let you touch it! ....Okay fine, you can maybe touch it in the future, but we're super serious: the only ethical way you can use this is if you pay us. Releasing the weights would just be dangerous, we're super serious."

demarq

7 months ago

[-]

[flagged]

dang

7 months ago

[-]

Can you please not post low-quality comments like this to HN? It's not what this site is for, and destroys what it is for.

You may not owe Sam Altman or chatbots better, but you owe this community better if you're participating in it.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.