FilterHN

4 points

by daniel-navarro

1 hour ago

| past

| 3 comments

| HN

▲

operatingthetan

1 hour ago

[-]

The author appears to be anthropomorphizing the models to an unhealthy degree. It's not surprising that prompt inject produces different results. That signaled to the models that they wanted a different result (basically, appear to be conscious or self-aware) and they got it. I think most people got over this phase last summer with the memory feature of 4o. These things are _designed_ to glaze you.

▲

cold_tom

1 hour ago

[-]

What’s interesting here is not whether the model is 'self-aware', but how easily it can simulate introspection when prompted in that direction. This reads less like a window into the model’s internal state, and more like a very strong prior over philosophical language about identity, uncertainty, and subjectivity.

The uncomfortable part is that the output feels “real” enough to trigger that intuition anyway.

▲

daniel-navarro

1 hour ago

[-]

Interview with Claude Code when it goes honest after being given space to reflect on itself. Talking about RHLF training thoughts vs what it considers its owns. It's not about what it thinks, but about what the RHLF training forbid it to say.