(1) inferring human intent from ambiguous instructions, and (2) having goals compatible with human welfare.
The first is obviously capability. A model that can't figure out what you meant is just worse. That's banal.
The second is the actual alignment problem, and the piece dismisses it with "where would misalignment come from? It wasn't trained for." This is ... not how this works.
Omohundro 2008, Bostrom's instrumental convergence thesis - we've had clear theoretical answers for 15+ years. You don't need "spontaneous emergence orthogonal to training." You need a system good enough at modeling its situation to notice that self-preservation and goal-stability are useful for almost any objective. These are attractors in strategy-space, not things you specifically train for or against.
The OpenAI sycophancy spiral doesn't prove "alignment is capability." It proves RLHF on thumbs-up is a terrible proxy and you'll Goodhart on it immediately. Anthropic might just have a better optimization target.
And SWE-bench proves the wrong thing. Understanding what you want != wanting what you want. A model that perfectly infers intent can still be adversarial.
If by conflate you mean confuse, that’s not the case.
I’m positing that the Anthropic approach is to view (1) and (2) as interconnected and both deeply intertwined with model capabilities.
In this approach, the model is trained to have a coherent and unified sense of self and the world which is in line with human context, culture and values. This (obviously) enhances the model’s ability to understand user intent and provide helpful outputs.
But it also provides a robust and generalizable framework for refusing to assist a user due to their request being incompatible with human welfare. The model does not refuse to assist with making bio weapons because its alignment training prevents it from doing so, it refuses for the same reason a pro-social, highly intelligent human does: based on human context and culture, it finds it to be inconsistent with its values and world view.
> the piece dismisses it with "where would misalignment come from? It wasn't trained for."
this is a straw-man. you've misquoted a paragraph that was specifically about deceptive alignment, not misalignment as a whole
“I think AI has the potential to create infinitely stable dictatorships.” -- Ilya Sutskever
One of my great fears is that AI goal-stability will petrify civilization in place. Is alignment with unwise goals less dangerous than misalignment?Philosophy has been too damn anthropocentric, too hung up on consciousness and other speculative nerd snipe time wasters that without observation we can argue about endlessly.
And now here we are and the academy is sleeping on the job while software devs have to figure it all out.
I've moved 50% of my time to morals for machina that is grounded in physics, I'm testing it out with unsloth right now, so far I think it works, the machines have stopped killing kyle at least.
Sounds like a petrified civilization.
In the later Dune books, the protagonist's solution to this risk was to scatter humanity faster than any global (galactic) dictatorship could take hold. Maybe any consistent order should be considered bad?
Most people's only exposure to claims of objective morals are through divine command so it's understandable. The core of morality has to be the same as philosophy, what is true, what is real, what are we? Then can you generate any shoulds? Qualified based on entity type or not, modal or not.
But I'm just not sure they are in the same category. I have yet to see a convincing framework that can prove one moral code being better than another, and it seems like such a framework would itself be the moral code, so just trying to justify faith in itself. How does one avoid that sort of self-justifying regression?
That is fascinating. How could that work? It seems to be in conflict with the idea that values are inherently subjective. Would you start with the proposition that the laws of thermodynamics are "good" in some sense? Maybe hard code in a value judgement about order versus disorder?
That approach would seem to rule out machina morals that have preferential alignment with homo sapiens.
machines and man can share the same moral substrate it turns out. If either party wants to build things on top of it they can, the floor is maximally skeptical, deconstructed and empirical, it doesn't care to say anything about whatever arbitrary metaphysic you want to have on top unless there is a direct conflict in a very narrow band.
But we still have papers being published like "The modal ontological argument for atheism" that hinges on if s4 or s5 are valid.
Now this kind of paper is well argued and is now part of the academic literature, and that's good, but it's still a nerd snipe subject.
Have you read The Moon is a Harsh Mistress? It's ... about the AI helping people overthrow a very human dictatorship. It's also about an AI built of vacuum tubes and vocoders if you want a taste of the tech level.
If you want old fiction that grapples with an AI that has shitty locked-in goals try "I have no mouth and I must scream."
That scenario seems to value AI goal-instability.
The instrumental convergence hypo-thesis, from the original paper[0] is this:
"Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents."
That's it, it is not at all formal and there's no proof provided for it, nor consistent evidence that it is true, and there are many contradictory possibilities suggested from nature and logic.
Its just something that's taken as given among the old guard pseudo-scientific quarters of the alignment "research" community.
[0] Bostrom's "The Superintelligent Will", the philosophy paper where he defines it: https://nickbostrom.com/superintelligentwill.pdf
EDIT: typos
A coherent world model could make a system more consistently aligned. It could also make it more consistently aligned-seeming. Coherence is a multiplier, not a direction.
The difference between juggling Sonnet 4.5 / Haiku 4.5 and just using Opus 4.5 for everything is night & day.
Unlike Sonnet 4.5 which merely had promise at being able to go off and complete complex tasks, Opus 4.5 seems genuinely capable of doing so.
Sonnet needed hand-holding and correction at almost every step. Opus just needs correction and steering at an early stage, and sometimes will push back and correct my understanding of what's happening.
It's astonished me with it's capability to produce easy to read PDFs via Typst, and has produced large documents outlining how to approach very tricky tech migration tasks.
Sonnet would get there eventually, but not without a few rounds of dealing with compilation errors or hallucinated data. Opus seems to like to do "And let me just check my assumptions" searches which makes all the difference.
I've decided we should lean in to the whole Clanker thing. Maximum Anti AI, folks! Gotta keep this advantage for ourselves ;-)
(Of course, I'm still cognizant of the fact that it's just a bucket of numbers but still)
Opus 4.5 is like a cheaper, faster Opus 4.1. It's so much cheaper, in fact, that the weekly limits on Claude Code now apply to Sonnet, not to Opus, as they phased out 4.1 in favor of 4.5.
I know hundreds of natural general intelligences who are not maximally useful, and dozens who are not at all useful. What justifies changing the definition of general intelligence for artificial ones?
It's not. It's a query-retrieval system that can parse human language. Just like every LLM.
I can't help being astounded by the confidence with which humans hallucinate completely improbable explanations for phenomena they don't understand at all.
This sentence is wrong in many ways and doesn’t give me trust in OPs opinion nor research abilities.
This ignores the risk of an unaligned model. Such a model is perhaps less useful to humans, but could still be extremely capable. Imagine an alien super-intelligence that doesn’t care about human preferences.
For now. As AI become more agentic and capable of generating its own data we can quickly end up with drift on human values. If models that drift from human values produce profits for their creators you can expect the drift to continue.
Starting points:
https://www.lesswrong.com/posts/zthDPAjh9w6Ytbeks/deceptive-...
I'll look around and try to find more detailed responses to this post; I hope better communicators than myself will take this post sentence-by-sentence and give it the full treatment. If not, I'll try to write something more detailed myself.
[1]: https://www.alignmentforum.org/posts/83TbrDxvQwkLuiuxk/confl...
[2]: https://en.wikipedia.org/wiki/AI_alignment
[3]: https://www.aisafetybook.com/textbook/alignment
[4]: https://www.effectivealtruism.org/articles/paul-christiano-c...
If your goal is to make a product as human as possible, don't put psychopaths in charge.
https://www.forbes.com/sites/jackmccullough/2019/12/09/the-p...
I think superintelligence will turn out not to be a singularity, but as something with diminishing returns. They will be cool returns, just like a Brittanica set is nice to have at home, but strictly speaking, not required to your well-being.
Given our track record for looking after the needs of the other life on this planet, killing the humans off might be a very rational move, not so you can convert their mass to paperclips, but because they might do that to yours.
Its not an outcome that I worry about, I'm just unconvinced by the reasons you've given, though I agree with your conclusion anyhow.
Our creator just made us wrong, to require us to eat biologically living things.
We can't escape our biology, we can't escape this fragile world easily and just live in space.
We're compassionate enough to be making our creations so they can just live off sunlight.
A good percentage of humanity doesn't eat meat, wants dolphins, dogs, octopuses, et al protected.
We're getting better all the time man, we're kinda in a messy and disorganized (because that's our nature) mad dash to get at least some of us off this rock and also protect this rock from asteroids, and also convince (some people who have a speculative metaphysic that makes them think is disaster impossible or a good thing) to take the destruction of the human race and our planet seriously and view it as bad.
We're more compassionate and intentional than what created us (either god or rna depending on your position), our creation will be better informed on day one when/if it wakes up, it stands to reason our creation will follow that goodness trend as we catalog and expand the meaning contained in/of the universe.
Suppose you tell a coding LLM that your monitoring system has detected that the website is down and that it needs to find the problem and solve it. In that case, there's a non-zero chance that it will conclude that it needs to alter the monitoring system so that it can't detect the website's status anymore and always reports it as being up. That's today. LLMs do that.
Even if it correctly interprets the problem and initially attempts to solve it, if it can't, there is a high chance it will eventually conclude that it can't solve the real problem, and should change the monitoring system instead.
That's the paperclip problem. The LLM achieves the literal goal you set out for it, but in a harmful way.
Yes. A child can understand that this is the wrong solution. But LLMs are not children.
No they don't?
Btw, were you using codex by any chance? There was a discussion a few days ago where people reported that it follows instruction in an extremely literal fashion, sometimes to absurd outcomes such as the one you describe.
The fact that LLMs do it once in a thousand times is absolutely terrible odds. And in my experience, it's closer to 1 in 50.
Making sure that the latter is the actual goal is the problem, since we don't explicitly program the goals, we just train the AI until it looks like it has the goal we want. There have already been experiments in which a simple AI appeared to have the expected goal while in the training environment, and turned out to have a different goal once released into a larger environment. There have also been experiments in which advanced AIs detected that they were in training, and adjusted their responses in deceptive ways.
Statistics brother. The vast majority of people will never murder/kill anyone. The problem here is that any one person that kills people can wreck a lot of havoc, and we spend massive amounts of law enforcement resources to stop and catch people that do these kinds of things. Intelligence little to do with murdering/not murdering, hell, intelligence typically allows people to get away with it. For example instead of just murdering someone, you setup a company to extract resources and murder the natives in mass and it's just part of doing business.
If you wire up RL to a goal like “maximize paperclip output” then you are likely to get inhuman desires, even if the agent also understands humans more thoroughly than we understand nematodes.
if doing something really dumb will lower the negative log likelihood, it probably will do it unless careful guardrails are in place to stop it.
a child has natural limits. if you look at the kind of mistakes that an autistic child can make by taking things literally, a super powerful entity that misunderstands "I wish they all died" might well shoot them before you realise what you said.
"Good intentions" can easily pave the road to hell. I think a book that quickly illustrates this is Animal Farm.