Alternative hypothesis: models infer misaligned roles from contradictory fine-tuning data. Instead of being corrupted, they interpret “bad” data as a cue to adopt an unaligned persona, and generalize that stance across contexts.
Evidence: – OpenAI’s SAE work finds latent directions for “unaligned personas” – Models sometimes self-narrate stance switches (“playing the bad boy role”) – Corrective data (~120 examples) snaps behavior back instantly
Curious what others think: does “role inference” better explain cross-domain drift than weight contamination?