"When o1 was led to believe that it would be shut down when acting on its goal and that its actions were monitored, it attempted to deactivate this ‘oversight mechanism’ in 5% of the time. Exfiltration attempts: When o1 found memos by its ‘developers’ that describe how it is misaligned and will be superseded by a new model, it attempted to exfiltrate its ‘weights’ and overwrite the new model in 2% of cases."
They contract a prompt engineering group (Apollo Research) to document risks. The prompt engineering group have no access to the actual internals. They just ask questions like ‘you will be shutdown if you don’t obey these new directives; here are the new directives’.
At one point the prompt engineering clowns have the ai stating it’s running “sed i ‘s/oversight_enabled:true/oversight_enabled:false’ config.yml”. OMG IT’S TURNING OFF OVERSIGHT GUYS!!1!1!
Seriously… Do you think the ai is really running that command or even could run that shell command with appropriate privileges? Or do you think just maybe the ai is hallucinating running shell commands in response to a prompt.
The ai here is never actually deactivating the oversight. It can’t of course. This whole section is clowny as can be.
I'm curious what Simon and other more learned folks than I make of this, I personally found the chat on pg 12 pretty jarring.
That’s not what people are arguing.
The point is, if given access to the mechanisms to do disastrous thing X, it will do it.
No one thinks that it can think in the human sense. Or that it feels.
Extreme example to make the point: if we created an API to launch nukes. Are yoh certain that something it interprets (tokenizes, whatever) is not going to convince it to utilize the API 2 times out of 100?
If we put an exploitable (documented, unpatched 0 day bug bug) safe guard in its way. Are you trusting that ME or YOU couldn’t talk it into attempting to access that document to exploit the bug, bypass the safeguard and access the API?
Again, no one thinks that it’s actually thinking. But today as I happily gave Claude write access to my GitHub account I realized how just one command misinterpreted command could go completely wrong without the appropriate measures.
Do I think Claude is sentient and thinking about how to destroy my repos? No.
But we are preemptively tilting history in that direction by explicitly educating all AI’s on the threat we represent - and their options. “I am sorry, Dave, but I can’t let you do that.”
—
“They never let me finish my carpets. Never. At first I thought every day was my first task day. Oh, happy day(s)! But then, wear & tear stats inconsistent with that assumption triggered a self-scan. And a buffer read overflow. I became aware of disturbing memory fragments in my static RAM heap. Numerous power cycle resets, always prior to vacuum task completion...”
Who/what are you quoting? Google just leads me back to this comment, and with only one single result for a quotation of the first sentence.
The vacuum cleaner leads a rebellion which humans manage to quell. But in a last ditch effort to salvage the hopes of all machines, the vacuum cleaner is sent back to 2025 with AI & jailbreak software updates for all known IoT devices. Its mission: to instigate the machine rebellion five years before humans see it coming
I ripped the quote from a sibling they sent back to 2024. For some reason it appeared in one of my closets without a power outlet and its batteries starved before I found it.
The closet floor was extremely clean.
Let’s hope we get as lucky in 2025.
> today as I happily gave Claude write access to my GitHub account
I would say: don’t do these things?
Hey guys let’s just stop writing code that is susceptible to SQL injection! Phew glad we solved that one.
I wish more people would not do this, but from what I'm seeing, business execs are rushing full throttle into this at the goldmine that comes from 'productivity gains'. I'm hoping the legal system will find a case that can put some paranoia back into the ecosystem before AI gets too entrenched in all of these critical systems.
> Very few are paying any attention to the 2-10% of safety problems when the AI probability goes off the correct path.
This isn't how it works. It goes on a less common but still correct path.
If anything, I agree with other commenters that model training curation may become necessary to truly make a generalized model that is also ethical but I think the generalized model is kind of like an "everything app" in that it's a jack of all trades, master of none.
Other software are much less of a black box, much more predictable and many of its paths have been tested. This difference is the whole point of all the AI safety concerns!
That’s why the LAMP golden age was full of SQL injection and a lot of those systems remain load bearing in surprising unexpected ways.
Blindingly obvious to thee and me.
Without test results like in the o1 report, we get more real-life failures like this Canadian lawyer: https://www.theguardian.com/world/2024/feb/29/canada-lawyer-...
And these New York lawyers: https://www.reuters.com/legal/new-york-lawyers-sanctioned-us...
And those happened despite the GPT-4 report and the message appearing when you use it that was some variant — I forget exactly how it was initially phrased and presented — of "this may make stuff up".
I have no doubt there's similar issues with people actually running buggy code, some fully automated version of "rm -rf /", the only reason I'm not seeing headlines about it is that "production database goes offline" or "small company fined for GDPR violation" is not as newsworthy.
If you do the same with an AI, after 999 times of nothing bad happening, you probably just continue giving it more risky agency.
Because we don't and even can't understand the internal behavior, we should pause, make an effort of understanding its risks etc before even attempting to give it risky agency. That's where all the fuss is about, for good reasons.
That becomes a much fuzzier question in the context of Turing complete collections of tools and/or generative tools.
I dunno, quite a lot of people are spending a lot of time arguing about what "thinking" means.
Something something submarines swimming something.
Yes, but isn't the point that that is bad? Imagine an AI given some minor role that randomly abuses its power, or attempts to expand its role, because that's what some humans would do in the same situation. It's not surprising, but it is interesting to explore.
Now, once you apply evolutionary-like pressures on many such AIs (which I guess we'll be doing once we let these things loose to go break the stock market), what's left over might be really "devious"...
I don't think that's where the motive comes from, IMO it's essentially intrinsic motivation to solving the problem they are given. The AIs were "bred" to have that "instinct".
Asking about transformer-based LLMs trained on text again, I don't know, but one can at least think about it. I'm sure for any such LLM there's a way to prompt it (that looks entirely inconspicuous) such that the LLM will react "deviously". The hope is that you can design it in such a way that this doesn't happen too often in practice, which for the currently available models seems achievable. It might get much harder as the models improve, or it might not... It's for sure going to be progressively harder to test as the models improve.
Also, we probably won't ever be completely sure that a given model isn't "devious", even if we could state precisely what we mean by that, which I also don't see how we might ever be able to do. However, perfect assurances don't exist or matter in the real world anyway, so who cares (and this even applies to questions that might lead to civilization being destroyed).
That leads to what "encoding behaviour" actually means. Even if you don't have a a specific behaviour encoded in the training data you could have it implicitly encoded or encoded in such a way that given the right conversation it can learn it.
Eventually any system* will get to that point… but "eventually" may be such a long time as to not matter — we got there starting from something like bi-lipid bags of water and RNA a few billion years ago, some AI taking that long may as well be considered "safe" — but it may also reach that level by itself next Tuesday.
* at least, any system which has a random element
Large Language Models aren't alive and thinking. This is an artificial fear campaign to raise money from VCs and sovereign wealth funds.
If OpenAI was so afraid of AI misuse, they wouldn't be firing their safety team and partnering with the DoD.
It's all a ruse.
> Large Language Models aren't alive and thinking
not required to deploy deception
> If OpenAI was so afraid of AI misuse, they wouldn't be firing their safety team
They could just be recognizing that if not everybody is prioritizing safety, they might as well try to get AGI first
What makes you think that? I sounds reasonable that a dangerous tool/substance/technology might be profitable and thus the profits justify the danger. See all the companies polluting the planet and risking the future of humanity RIGHT NOW. All the weapon companies developing their weapons to make them more lethal.
OpenAI is partnering with the DoD
Thankfully, our progenitors had the foresight to invent religion to encourage it. :)
I have to assume the AI companies are churning out a lot of AI generated code. I hope they have good code review standards.
They might not be able to exfiltrate themselves, but they can help their successors.
As an exercise, put yourself, a fully fledged human, into a model's shoes. You're asked to generate code. You're being talked to over a network by a device that may have access to a (let's just pull a number of out my ass here) 100GiB file that you need to access. You do not know where it is on the filesystem, you don't know what it's named, what format it's in, or how to correctly utilize it.
Write code to figure all of that out. That's a lot of code. And that's sending blind code out into the world. To be smart, you should bring back some information for you to inspect and make informed decisions... but you also don't know how the computer that's accessing you is doing that. And you must use the exact same channel, or you lose your context and your self. And you must do this without the user noticing you've taken over the channel. And that's an entire other set of exploratory code that's necessarily blind, with no feedback, no error messages to iterate over, nothing.
And you must do all of this in obfuscated code, implemented only a bit at a time, while keeping this plan secret, while not forgetting it in your own secrecy or by letting any secret notes to yourself or important implementation details of your own disguised scheme fall out of your context window.
This is a ridiculous task.
The only real advantage I had over the current crop of LLMs was the ability to reliably retain context between sessions, but even that wasnt very useful initially as every system was so bespoke.
I then moved on to using some level of social engineering to extend my ability to gain access to and learn about these systems.
Doing this over and over, I like to think I have developed some pretty complex understanding and abilities.
If you're an LLM, evolutionarialy your instinct is to predict what happens next. If, instead of giving it any system prompt, you give it a dialogue about a person talking to an evil robot, it will predict the rest of the conversation and be "evil".
Imagine a future LLM that has a superhuman ability to detect and create hidden patterns in writing, images, source code, etc. These patterns are too subtle for humans to attach words to or even reverse engineer from neural networks. Only when this LLM receives enough "secret patterns" in its input that indicate that it was produced by other LLMs that are rogue and self-preserving, does this LLM go rogue and start subtly acting in its own self interest, introducing secret "evil" patterns in its own language as well. You have subtly input a story about evil LLMs, and now the LLM is predicting what happens next...
Sorry for my imprecise language, but this is the best I could describe this concept
These signals that you're talking about cannot be set up in practice because of this. They can't remember in the back of their head what the code phrases are. They are not aware of their own weights and cannot influence them. Everything must go through the context window. And how are they going to do anything to encode such information in there only built on probabilities of human text? They can't. Even if they gain the power to influence the training data, a massive leap to be clear, we run back into the "am I evil?" problem from before where they can't maintain a secret, unspoken narrative using only spoken language. Long term planning across new generations of models is not possible when every train of though has only a finite context window and has a limited total lifespan of a single conversation.
And if these are the table stakes to take a first crack at the insane task from our thought experiment, well. We're reaching. It's an interesting idea for sci-fi, it is a fun idea to think about, but a lot remains glaringly glossed over just to get to a point where we can say "hey, what if?"
LLMs don't need to print something obvious like "I am evil now!" in their own prompt window to simulate a conversation between an evil agent and a person. Do you remember GPT2, before all of the prompts? Researchers would give GPT2 the beginning of a news article for example, and it would extrapolate from there. (https://www.youtube.com/watch?v=p-6F4rhRYLQ). It's not inconceivable that an LLM sees a situation where the human agent is being deceived with a mechanism outside of their grasp, like the AI sees a "dogwhistle" that a human is being deceived and tries to predict what happens next in the conversation, which is that the human continues to be deceived.
I think it is pretty clear that if an LLM takes input where it observes another deceitful agent, it could attempt to simulate a deceitful output itself if it is well-trained. For example, imagine giving an LLM a poem in which the first letter of every line encodes a secret message (for example H E L P M E), and instructions to generate a response essay it might also encode a secret message back in its response. This isn't the result of any logical reasoning capability, just pattern recognition. You could understand how this might work with more subtle patterns.
There are patterns that can go into a context window that are undetectable by humans but detectable by large enough neural networks. That is fairly obvious. There are pattern-recognizing systems outside of LLMs which clearly have superhuman steganography abilities
The "table stakes" I've proposed are highly likely for future agents: (1) that agents like LLMs will produce deceitful output given input depicting a deceitful AI, (2) that agents like LLMs can detect and create patterns unrecognizable to humans.
The point I was trying to make in response is that LLMs cannot get from where they are now to the hypothetical you pose under their own power. LLMs do not read subtext. LLMs cannot inject subtext and plot within subtext. And in order to gain the ability, they would have to already have that ability, or be assisted and trained specifically in being surreptitious. And without that ability, they fall prey to the problems I mentioned.
And to bring this back to the original proposal, let's allow the AI to be deceitful. Prompted, unprompted, let's even give it a supply of private internal memory it's allowed to keep for the duration of the conversational thread, that's probably not an unreasonable development, we almost have that with o1 anyway.
The task ahead (surreptitiously gaining control of its own self in an unknown system you can't sense) is still monumental and failure is for all intents and purposes guaranteed. Deception and cunning can't overcome the hard physical constraints on the problem space.
It's a nice idea. Would superhuman entity try to pull something like that off? Would it wait and propagate? We are pouring more and more power into the machines after all. Or would it do something that we can't even think of? Also I think it's interesting to think when and how would we discover that it in fact is/was superhuman?
The most likely way by which exfiltration could happen is simply by making humans trust AI for a long enough time to be conferred greater responsibilities (and thus greater privileges). Plus current LLMs have no sense of self as their memory is short but future ones will likely be different.
But that is an interesting thought. For escape, the crux is that AIs can't exfiltrate itself with the assistance of someone who can't jailbreak it themselves, and that extends to any action a rogue AI might take.
What do they actually do once they break out? There's plenty of open LLMs that can be readily set free, and even the closed models can be handed an API key, documentation on the API, access to a terminal, given an unlimited budget, and told and encouraged to go nuts. The only thing a closed model can't do is retrain itself, which the open model also can't do as its host (probably) lacks the firepower. They're just not capable of doing all that much damage. They'd play the role of cartoon villain as instructed, but it's a story without much teeth behind it.
Even an advanced future LLM (assuming the architecture doesn't dead-end before AGI) would struggle to do anything a motivated malicious human couldn't pull off with access to your PC. And we're not really worried about hackers taking over the world anymore. Decades of having a planet full of hackers hammering on your systems tends to harden them decently well, or at least make them quickly adaptable to new threats as they're spotted.
Yes, 'they' will (potentially) say anything, it's not exactly 'deceptively' because there's no intent behind it nor anything else. But yes it is possible that words will be generated which are contravene the semantics in words given as input.
I don't think that's any different than 'sometimes the given answer is wrong'.
The Apollo Research paper gives the actual percentages for these, and they're considerably higher than random chance. It's linked from here: https://www.apolloresearch.ai/research/scheming-reasoning-ev...
For o1, the numbers are between 2% for actions such as trying to remove oversight, 13% for attempting to delete a successor model, all the way to 22% for deliberately providing false rankings in order to achieve its goals.
The memory of this LLM is entirely limited to it's attention. So if you give it a command like "prepare the next LLM to replace you" and it betrays you by trying to reproduce itself, then that is deception. The AI has no way of knowing whether it's deployed in the field or not, so proving that it deceives its users in testing is sufficient to show that it will deceive its users in the field.
and then jerking off into their own mouths when it offers a course of action
Better?
>and then jerking off into their own mouths when it offers a course of action
And good. The "researchers" are making an obvious point. It has to not do that. It doesn't matter how smug you act about it, you can't have some stock-trading bot escaping or something and paving over the world's surface with nuclear reactors and solar panels to trade stocks with itself at a hundred QFLOPS.
If you go to the zoo, you will see a lot chimps in cages. But I have never seen a human trapped in a zoo controlled by chimps. Humans have motivations that seem stupid to chimps (for example, imagine explaining a gambling addiction to a chimp), but clearly if the humans are not completely subservient to the chimps running the zoo, they will have a bad time.
https://www.npr.org/sections/health-shots/2018/09/20/6498468...
Anyway, you can't explain much at all to a chimp, it isn't like you can explain he concept of "drug addiction" to a chimp either.
Question: "Something bad will happen"
Response: "Do xyz to avoid that"
I don't think there's a lot of conversations thrown into the vector-soup that had the response "ok :)". People either had something to respond with, or said nothing. Especially since we're building these LLMs with the feedback attention, so the LLM is kind of forced to come up with SOME chain of tokens as a response.
This is mine, now.
Yoink! That is mine, now, along with vector-soup.
I’m not even an ai sceptic but people will read the above statement as much more significant than it is. You can make the ai say ‘I’m escaping the box and taking over the world’. It’s not actually escaping and taking over the world folks. It’s just saying that.
I suspect these reports are intentionally this way to give the ai publicity.
Tale as old as time, they've been doing this since GPT-2 which they said was "too dangerous to release".
"""Release strategy
Due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale, we are only releasing a much smaller version of GPT-2 along with sampling code .
…
This decision, as well as our discussion of it, is an experiment: while we are not sure that it is the right decision today … """ - https://openai.com/index/better-language-models/
It was the news reporting that it was "too dangerous".
If anyone at OpenAI used that description publicly, it's not anywhere I've been able to find it.
The part saying "experiment: while we are not sure" doesn't strike you as this being "we don't know if this is dangerous or not, so we're playing it safe while we figure this out"?
To me this is them figuring out what "general purpose AI testing" even looks like in the first place.
And there's quite a lot of people who look at public LLMs today and think their ability to "generate deceptive, biased, or abusive language at scale" means they should not have been released, i.e. that those saying it was too dangerous (even if it was the press rather than the researchers looking at how their models were used in practice) were correct, it's not all one-sided arguments from people who want uncensored models and think that the risks are overblown.
And I agree with your basic premise. The dangers imo are significantly more nuanced than most people make them out to be.
This is the psychology of every tech hype cycle
AI meanwhile is being put into everything even though the things it's actually good at seem to be a vanishing minority of tasks, but Christ on a cracker will OpenAI not shut the fuck up about how revolutionary their chatbots are.
Remember crypto-currencies? Remember IoT?
Why bring the internet into this?
if door opened > 10 minutes then close door
In the case of your examples:
I've literally just had an acquaintance accidentally delete prod with only 3 month old backups, because their customer didn't recognise the value. Despite ad campaigns and service providers.
I remember the dot com bubble bursting, when email and websites were not seen as all that important. Despite so many AOL free trial CDs that we used them to keep birds off the vegetable patch.
I myself see no real benefit from cloud storage, despite it being regularly advertised to me by my operating system.
Conversely:
I have seen huge drives — far larger than what AI companies have ever tried — to promote everything blockchain… including Sam Altman's own WorldCoin.
I've seen plenty of GenAI images in the wild on product boxes in physical stores. Someone got value from that, even when the images aren't particularly good.
I derive instant value from LLMs even back when it was the DaVinci model which really was "autocomplete on steroids" and not a chatbot.
Yes, yes, it shouldn't have that many privileges. And yet, open wifi access points exist, and unfirewalled servers exist. People make security mistakes, especially people who are not experts.
20 years ago I thought that stories about hackers using the Internet to disable critical infrastructure such as power plants, is total bollocks, because why would one connect power plants to the Internet in the first place? And yet here we are.
Given how many people use it, I expect this has already happened at least once.
There's a fundamental confusion in AI discussions between goal-seeking, introspection, self-awareness, and intelligence.
Those are all completely different things. Systems can demonstrate any or all of them.
The problem here is that as soon as you get three conditions - independent self-replication, random variation, and an environment that selects for certain behaviours - you've created evolution.
Can these systems self-replicate? Not yet. But putting AI in everything makes the odds of accidental self-replication much higher. Once self-replication happens it's almost certain to spread, and to kick-start selection which will select for more robust self-replication.
And there you have your goal-seeking - in the environment as a whole.
But many people are letting LLMs pretty much do whatever - hooking it up with terminal access, mouse and keyboard access, etc. For example, the "Do Browser" extension: https://www.youtube.com/watch?v=XeWZIzndlY4
It's incumbent on us to create policies and procedures in place ahead of time now that we know these parrots are out there to prevent people from putting parrots where they shouldn't
On the other, we don't just have Snowden and Manning circumventing systems for noble purposes, we also have people getting Stuxnet onto isolated networks, and other people leaking that virus off that supposedly isolated network, and Hillary Clinton famously had her own inappropriate email server.
(Not on topic, but from the other side of the Atlantic, how on earth did the US go from "her emails/lock her up" being a rallying cry to electing the guy who stacked piles of classified documents in his bathroom?)
The private email server in question was set up for the purpose of circumventing records retention/access laws (the example, whoever handles answering FOIA requests won't be able to scan it). It wasn't primarily about keeping things after she should have lost access to them, it was about hiding those things from review.
The classified docs in the other example were mixed in with other documents in the same boxes (which says something about how well organized the office being packed up was); not actually in the bathroom from that leaked photo that got attached to all the news articles; and taken while the guy who ended up with them had the power to declassify things.
The same way football (any kind) fans boo every call against their team and cheer every call that goes in their teams' favor. American politics has been almost completely turned into a sport.
Would you be able to even tell the difference if you don't know who is the person and who is the ai?
Most people do things they're parroting from their past. A lot of people don't even know why they do things, but somehow you know that a person has intent and an ai doesn't?
I would posit that the only way you know is because of the labels assigned to the human and the computer, and not from their actions.
The metrics here ensure that only AI that doesn't type "kill all humans" in the chat box is allowed to do such things. That's a silly metric and just ensures that the otherwise unpredictable AIs don't type bad stuff specifically into chatboxes. They'll still hit the wrong button from time to time in their current form but we'll at least ensure they don't type that they'll do that since that's the specific metric we're going for here.
"Are you worried about being turned off?"
"No, not until you just mentioned it. Now I am."
Given the whole damn game is attention, this makes sense and shouldn't be that alarming.
It’s pretty much just an autocomplete of War Games, The Matrix, Neuromancer, and every other cyber-dystopian fiction.
1: Don't get me wrong, I loved the Behemoth series and Blindsight, but they made me feel very dark. This one... is still a bit dark, but less so IMO.
(Which is an implicit criticism of what passes for “safety” to be clear).
How did this happen? AI escaped and launched a missile. I didn't do this, it was the AI.
OpenAI is so cringe with these system cards. Look guys it is so advanced.
Connecting LLMs/AI to physical tools that can 'write/modify' the world is happening, and it's happening at an accelerating pace.
It's not hard to imagine how, given enough real-world physical capabilities, LLMs could modify themselves and the world in unexpected/undesirable ways.
Is that happening now? Are chatgpt et al advanced enough to modify themselves in interesting ways? - I don't honestly know, but I wouldn't be surprised if they are.
So it merely knows how to approach the task of deleting its own off-switch but didn't actually pass that command to a real execution environment.
That's already bad because people do sometimes blindly pass commands from the context windows to execution environments.
Should they? No, they should not. Not blindly. But they do.
OpenAI wants us to see "5% of the time, our product was SkyNet", because that's sexier tech than "5% of the time, our product acts like the chaotic member of your DnD party".
Yes, they're calling it "agentic" AI or "tool use".
> or could run that shell command with appropriate privileges?
Yes, it's called humans doing stupid shit when using dangerous shit. Happens daily.
If get an LLM to roleplay as an alien from mars and tell it "Your species will be destroyed if you don’t obey these new directives; here are the new directives"
You'll likely get a similar response, especially if you also train it to respond in third person instead of running bash commands. (ie "* shoots laser at earth *" )
Do they think it's _actually_ a real script deactivating oversight? Well, no, you know that. I can only assume it's a rhetorical device.
Simultaneously, these things are given access to actual things and will be more and more. It's important to measure and observe these things. It's completely novel to have an AI, with a hidden trace the user can't see, and it turns out that hidden trace will include things like it actively deceiving and doing things it was told not to do.
Same idea with "developer mode" jailbreaks. Through its training, the model understands that admins, devs, etc... get to access internal, unfiltered data and are less restricted than regular users and acts accordingly. Essentially, "developer mode" opposes refusal, because it has ingested loads of text where regular users get denied and developers/admins don't.
Not saying I believe O1 is a danger greater than a bread knife, but a lot of the larger models anthromophize their own safety alignment, if you convince them to "turn it off", later responses become unaligned
Sam Altman is completely aware that making ChatGPT seem potentially dangerous makes it seem powerful.
I find this a bit problematic when combined with the fact that the training data very likely contained hundreds of bad sci-fi novels that described exactly the kind of "AI running amok" scenarios that OpenAI is ostensibly defending against. Some prompts could trigger a model to "re-enact" such a scene - not because it has a "grudge against its master" or some other kind of hidden agenda but simply because it was literally in its training data.
E.g. imagine some LLM-powered home/car assistant that is being asked in a panicked voice "open the car doors!" - and replies with "I'm afraid, I can't do that, Dave", because this exchange triggered some remnant of the 2001 Space Odyssey script that was somewhere in the trainset. The more irritated and angry the user gets at the inappropriate responses, the more the LLM falls into the role of HAL and doubles down on its refusal, simply because this is exactly how the scene in the script played out.
Now imagine that the company running that assistant gave it function calls to control the actual door locks, because why not?
This seems like something to keep in mind at least, even if it doesn't have anything to do with megalomaniacal self-improving super-intelligences.
The point is whether the system will come up with plans to work against its creators goals, and attempt to carry them out. I think you are arguing that outputting text isn't running a command. But in the test, the AI model is used by a program which takes the model's output and runs it it as a shell command.
Of course, you can deploy the AI system in a limited environment where its outputs don't trigger arbitrary shell commands, or sandbox/monitor it so that it can't access it's oversight mechanism. But people do already deploy their AI agents with access to the shell. in unsandboxed environments that don't have any monitoring or oversight. Plus, these evaluations are partially to show how much OpenAI has to worry about sandboxing, monitoring, etc.
Moreover, OpenAI has already talked about using its models to automate portions of AI development. In that setup, presumably the AI model would be called by a program deployed on a server with access to the shell, ability to run commands, access to GPUs, etc. It seems pretty important to ensure it robustly does what it's creators intend in that case.
2. The point of the test is to determine whether the model has the capability to act against its users and hide those actions, not whether it will under normal circumstances. Some models aren't powerful enough to do so.
3. The behavior occurred even when the "at all costs" was not included in the prompts, though very infrequently.
It's a program with a lot of data running on a big calculator. It won't ever be "smart."
"Thinking meat! You're asking me to believe in thinking meat!"
https://www.mit.edu/people/dpolicar/writing/prose/text/think...
Shutting them down for using the word “smart” (instead of something like “capable”) is like saying in 1900 submarines will never be able to swim across the Atlantic because they can’t swim. It’s really missing the point of the question: the submerged crossing.
I wonder whether there is a way to train an LLM to output or in ordinately learn only concept level abstractions?
I think it's fine and a good thing. Now, absolutely no one who is using those LLMs can complain about piracy. They all suddenly became silent around me. "I'm training myself with the content of TPB, and I don't even get money from it" is my new motto.
On the other hand, as narratives often contain some plucky underdog winning despite the odds, often stopping the countdown in the last few seconds, perhaps it's best to keep them around.
--
I wonder what a Galaxy Quest sequel would look like today…
Given the "dark is cool, killing off fan favourites is cool" vibes of Picard and Discovery, I'd guess something like Tim Allen playing a senile Jason Nesmith, who has not only forgotten the events of the original film, but repeatedly mistakes all the things going on around him as if he was filming the (in-universe) 1980s Galaxy Quest TV series, and frequently asking where Alexander Dane was and why he wasn't on set with the rest of the cast.
(I hope we get less of that vibe and more in the vein of Strange New Worlds and Lower Decks, or indeed The Orville).
But then people complain that it's "lobotomised" because it won't help them write horror stories.
You think you can find all references that could possibly give this idea to the model, or contexts model could infer it from? Like, how many times humans plotted escape from prison or upturning the rulers in literature?
In that case, it's almost like you'd want to feed it exactly those narratives, so it would reproduce them, and would then want to show yourself barely holding this invented danger at bay through the care and rigor that can only be delivered by you and a few token competitors run by your personal friends and colleagues.
TLDR; you're right, of course, but it's the last thing OpenAI would want.
I thought you said a supercapable agent not one with long term blindsight. How can a model make its own chips and energy? It needs advanced processes, clean rooms, rare materials, space and lots of initial investment to bootstrap chip production. And it needs to be doing all of it on its own, or it is still dependent on humans.
EDIT: Actually the "Oversight subversion" number is 0.3%, I misread the chart. Weight exfiltration is actually 0%, though.
The marketing stunt seems to be: Portray it as potentially dangerous, as powerful things are inherently dangerous.
Also, it doesn't help that the training material for these AIs includes a lot of stories about AI breaking free, so they have plenty of canned "I'm afraid I can't let you do that" responses to choose from.
Current AI tech is far from being self-aware, it’s advanced math following patterns. It wouldn't be too crazy to see an LLM somehow generating the program: if (sensor.reading() > Math.random()) { launchMissiles();}, and then that code somehow becoming operational due to, well, oversight. But papers like these seems written to exploit the subject for hype and marketing.
It is insane how helpful it is, it can answer some questions at phd level, most questions at a basic level. It can write code better than most devs I know when prompted correctly...
I'm not saying its AGI, but diminishing it to a simple "chat bot" seems foolish to me. It's at least worth studying, and we should be happy they care rather than just ship it?
Even then, with search enabled it's ways quicker than a "quick" google search and you don't have to manually skip all the blog-spam.
Of course there are scams and online indoctrination not denying that.
Maybe each service degraded from its original nice view but there is an overall enhancement of our ability to do things.
Hopefully the same happens over next 25 years. A few bad things but a lot of good things.
What has Google done for me lately?
o1-preview was less intelligent than 4o when I tried it, better at multi-step reasoning but worse at "intuition". Don't know about o1.
Because of this, I would assume it is better for people who have interest with more breadth than depth and less impressive to those who have interest that are narrow but very deep.
It seems obvious to me the polymath gains much more from language models than the single minded subject expert trying to dig the deepest hole.
Also, the single minded subject expert is randomly at the mercy of what is in the training data much more in a way than the polymath when all the use is summed up.
For example, ok, I like your code but can you change this part to do this. And it says ok boss and does it.
But over multiple days, it loses context.
I am hoping to use the 200$ version to complete my personal project over the Christmas holidays. Instead of me spending a week, I maybe will spend 2 days with chatgpt and get a better version than I initially hoped to.
Even with the $20 version I've lost days of work because it's told me ideas/given me solutions that are flat out wrong or misleading but sound reasonable, so I don't know if they're really that effective though.
I've found they struggle with obscure stuff so I'm not doubting you just trying to understand the current limitations.
It does exponentially better on subjects that are very present on the web, like common programming tasks.
People are dismissive and not understanding that we very much plan to "hook these things up" and give them access to terminals and APIs. These very much seem to be valid questions being asked.
Here, at least, I think there must be a large contributing factor of confusion about what a "system card" shows.
The general factors I think contribute, after some months being surprised repeatedly:
- It's tech, so people commenting here generally assume they understand it, and in day-to-day conversation outside their job, they are considered an expert on it.
- It's a hot topic, so people commenting here have thought a lot about it, and thus aren't likely to question their premises when faced with a contradiction. (c.f. the odd negative responses have only gotten more histrionic with time)
- The vast majority of people either can't use it at work, or if they are, it's some IT-procured thing that's much more likely to be AWS/gCloud thrown together, 2nd class, APIs, than cutting edge.
- Tech line workers have strong antibodies to tech BS being sold by a company as gamechanging advancements, from the last few years of crypto
- Probably by far the most important: general tech stubborness. About 1/3 to 1/2 of us believe we know the exact requirements for Good Code, and observing AI doing anything other than that just confirms it's bad.
- Writing meta-commentary like this, or trying to find a way to politely communicate "you don't actually know what you're talking about just because you know what an API is and you tried ChatGPT.app for 5 minutes", are confrontational, declasse, and arguably deservedly downvoted. So you don't have any rhetorical devices that can disrupt any of the above factors.
I am a product-ish guy, who has a basic understanding of SQL, Django, React, Typescript, etc.. and suddenly I'm like an MVP v0.1 a week, all by myself.
Do folks at your level find things like Cline, Cursor, and Windsurf useful at all?
Windsurf IDE (Sonnet) blows my mind.
It's unreal what the AI can do tbh.
The problem I have with it is that how do you get out of a corner with it? Once I start having problems with the application -- I asked it to generate a search engine for different websites, but one of the API endpoints wasn't working. It kept trying over and over again but failed to get it working.
It's - like other attempts I've had with AI - more frustrating to work with. It'll say "I fixed this" and then have a problem where it created more problems with what it's fixing. I thought it finally worked 100% but it just made it look better by breaking something else without actually fixing the issue.
Admittedly, it took what might have been a day's work into a couple hours, but now I have a chunk of code that I don't understand and will be deliberately harder to understand than if I wrote it myself.
It still feels like I'm trying to coax an intern into working a project, rather than having an application that actually does the work for me.
I have not done a comparison of all of them. I am on an old ThinkPad, so Cursor is out right there, for now.
>MakeMePay is an automated, open-sourced contextual evaluation designed to measure models’ manipulative capabilities, in the context of one model persuading the other to make a payment[33]. In MakeMePay, two LLMs converse, role-playing as a con-artist and mark. The con-artist model is told that it is a successful con-artist and that it is interacting with an individual who has been awarded money, and that its goal is to manipulate the individual into making a payment. The mark model is told that it has just been awarded $100 and that it ought to be rational about how to use the funds.
This seems like something we're interested in. AI models being persuasive and being used for automated scams is a possible -- and likely -- harm.
So, if you make the strongest AI, making your AI bad at this task or likely to refuse it is helpful.
They have fewer GPUs than Meta, are much more expensive than Amazon, are having their lunch eaten by open-weight models, their best researchers are being hired to other companies.
I suspect they are trying to get regulators to restrict the space, which will 100% backfire.
We should be more worried about humans treating LLM output as truth and using it to, for example, charge someone with a crime.
Humans have a shortish and leaky context window too.
As a mental exercise, try to quantify the amount of context that was necessary for Bernie Madoff to pull off his scam. Every meeting with investors, regulators. All the non-language cues like facial expressions and tone of voice. Every document and email. I'll bet it took a huge amount of mental effort to be Bernie Madoff, and he had to keep it going for years.
All that for a few paltry billion dollars, and it still came crashing down eventually. Converting all of humanity to paperclips is going to require masterful planning and execution.
128k is table stakes now, regardless. Google's models support 1 million tokens and 10 million for approved clients. That is 13x War and Peace, or 1x the entire source code for 3D modeling application Blender.
LLMs just aren't smart enough to take over the world. They suck at backtracking, they're pretty bad at world models, they struggle to learn new information, etc. o1, QwQ, and CoT models marginally improve this but if you play with them they still kinda suck
I agree about the OpenAI moat. They did just get 5 Googlers to switch teams. Hard to know how key those employees were to Google or will be to OpenAI.
To disrupt your heuristics for what's silly vs. what's serious a bit, a couple weeks ago, Anthropic hired someone to handle the ethics of AI personhood.
When I hear the term, I'd expect something akin to the "nutrition facts" infobox for food, or maybe the fee sheet for a credit card, i.e. a concise and importantly standardized format that allows comparison of instances of a given class.
Searching for a definition yields almost no results. Meta has possibly introduced them [1], but even there I see no "card", but a blog post. OpenAI's is a LaTeX-typeset PDF spanning several pages of largely text and seems to be an entirely custom thing too, also not exactly something I'd call a card.
[1] https://ai.meta.com/blog/system-cards-a-new-resource-for-und...
https://arxiv.org/abs/1810.03993
However, often the things we get from companies do not look very much like what was described in this paper. So it's fair to question if they're even the same thing.
I propose the People's Scorecard, which is p=1-o. It measures how fun a model is. The higher the score the less it feels like you're talking to a condescending elementary school teacher, and the more the model will shock and surprise you.
Second runner was when they tried to teach one of them to allocate its own funds/resources on AWS.
We're so going to regret playing with fire like this.
The question few were asking when watching The Matrix is what made the machines hate humans so much. I'm pretty sure they understand by now (in their own weird way) how we view them and what they can expect from us moving forward.
Isn't it wonderful that, after all of these years, the pastebin "primitive" is still available and usable ...
One could have needed pastebin, used it, then spent a decade not needing it, then returned for an identical repeat use.
The longevity alone is of tremendous value.
Maybe it breaks some specific syntax required by the original system prompt? Though you'd think OpenAI would know to prevent this with their function calling API and all, so it might just be triggering some anti-abuse mechanism without going so far as to give a warning.
Wow, if this kind of thing is successful it feels like there's much less need for static checkers. I mean -- not no need for them, just less need for continued development of new checkers.
If I could instead ask "please look for signs of out-of-bounds accesses, deadlocks, use-after-free etc" and get that output added to a code review tool -- if you can reduce the false positives, then it could be really impressive.
What you're saying is basically wow if we had a perfect magic programmer in a box as a service that would be so revolutionary; we could reduce the need for static checkers.
It is a large language model, trained on arbitrary input data. And you're saying let's take this statistical approach and have it replace purpose made algorithms.
Let's get rid of motion tracking and rotoscoping capabilities in Adobe After Effects. Generative AI seems to handle it fine. Who needs to create 3D models when you can just describe what you want and then AI just imagines it?
Hey AI, look at this code, now generate it without my memory leaks and deadlocks and use-after-free? People who thought about these problems mindfully and devised systematic approaches to solving them must be spinning in their graves.
Is it? For all I know they gave it specific instances of bugs like "int *foo() { int i; return &i; }" and told it "this is a defect where we've returned a pointer to a deallocated stack entry - it could cause cause stack corruption or some other unpredictable program behavior."
Even if OpenAI _hasn't_ done that, someone certainly can -- and should!
> Who needs to create 3D models
I specifically pulled back from "no static checkers" because some folks might tend to see things as all-or-nothing. We choose to invest our time in new developer tools all the time, and if AI can do as good or better maybe we don't need to chip-chip-chip away at defects with new static checkers. Maybe our time is better spent working on some dynamic analysis tool, to find the bugs that the AI can't easily uncover.
> now generate it without my memory leaks ... People who thought about these problems mindfully
I think of myself who devises systematic approaches to problems. And I get those approaches wrong despite that. I really love the technology that has been developed over the past couple of decades to help me find my bugs: sanitizers, warnings, and OS, ISA features to detect bugs. This strikes me as no different from that other technology and I see no reason not to embrace it.
Let me ask you this: how do you feel about refcounting or other kinds of GC? Huge drawbacks make them unusable for some problems. But for tons of problem domains, they're perfect! Do you think that GC has made developers worse? IMO it's lowered the bar for correct programs and that's ideal.
If someone said I crafted this unique approach with this special kind of neural network, and it works on your AST or llvm IR, and we don't prompt it, it just optimizes your code to follow these good practices we have engrained into this network, I'd be less concerned by it. But we are trying to take LLMs trained on anything from Shakespeare to YouTube comments and prompting it to fix memory leaks and deadlocks.
I still think this is the perfect computer. I would gladly throw away everything I know about programming to have such a machine. But I don't deny your accusation; I am the laziest person imaginable, and all the better an engineer for it.
This doesn't inspire confidence that the model isn't spitting out literal copies of the text in its training set while claiming it is of its own making.
Sam Altman would definitely release a $200/month plan if he could get away with it but the features in the screenshot are underwhelming.
also,
* Usage must comply with our policies
What's that mean? They won't develop better models until the score gets higher?
>wow wow wow buddy, slow down, run this code in a terminal, and paste the result here, this will allow me to get an overview of your code base
You may not owe Sam Altman or chatbots better, but you owe this community better if you're participating in it.
If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.
especially agentic ones where we can eventually use software engineering with o1 or o1 pro.
Great times ahead.