It made me cautiously optimistic that all of Anthropic’s work on alignment, which they did for AI safety, is actually the cause of Claude code’s comparatively superior utility (and their present success). I wonder if future progress (maybe actual AGI?) lies in the direction of better and better alignment, so I think this is super cool and I’m suddenly really interested in experiments like this
SHAP basically does point by point ablation across all possible subsets, which really doesn't make sense for LLMs. This is simultaneously too specific and too general.
It's too specific because interesting LLM behavior often requires talking about what ensembles of neurons do (e.g. "circuits" if you're of the mechanistic interpretability bent), and SHAP's parameter-by-parameter approach is completely incapable of explaining this. This is exacerbated by the other that not all neurons are "semantically equal" in a deep network. Neurons in the deeper layers often do qualitatively different things than earlier layers and the ways they compose can completely confuse SHAP.
It's too general because parameters often play many roles at once (one specific hypothesis here is the superposition hypothesis) and so you need some way of splitting up a single parameter into interpretable parts that SHAP doesn't do.
I don't know the specifics of what this particular model's approach is.
But SHAP unfortunately does not work for LLMs at all.
Here is what this model does: it `rewrites` the model's activations (during pre-training) into supervised + unsupervised concepts that are then decoded into tokens. So at pre-training, we constrained the model with 33k supervised concepts (e.g., sports, toxicity, alignment, demographic variables), and then have more (101k) unsupervised concepts for the model to learn as well.
Overall, the architecture and loss functions of this model allow you to answer the following questions: 1) Which token in the context caused a chunk (group of tokens) to be generated? 2) which high level concept (supervised or unsupervised) caused the 3) perhaps more interestingly, in a single forward pass, we can tell you which training chunk led to the output of the model as well.
We do all of this for the single steerling model which is 8B parameters trained on 1.5T tokens. First time any model of this scale has achieved this level of interpretability by design.
would be happy to answer more questions.
I don't quite grasp how to interpret the training data attribution process. For example, it seems to say that for a given sentence like "They argued that humans tend to weigh losses more heavily than gains, leading to risk aversion", 24% is attributed to Wikipedia and 23% to Arxiv.
Does that mean that the concepts used in this sentence are also found in those datasets, and that's what's getting compared here? Or does it mean that you can track down which parts of the training data were interpolated to create that sentence?
We can attribute to exact sentences and chunks in the training data. For the first release, we are sharing only concept similarities. Over the coming weeks, we'll share and discuss how you can actually map to the exact training sentence and chunk with the model.
For a technical overview of how some of these models work, check this link out: https://www.guidelabs.ai/post/prism/
From reading your second link (and please tell me if I got it wrong) it sounds like it isn't actually tracking to training data but to prototypes which are then linked a posteriori to likely sections of the training data. The attribution isn't exact, right? It's more like "these are the likely texts that contributed to one of those prototypes that produced the final answer." Specifically the bit in PRISM titled "Nearest neighbour Search" sounds like you could have a prototype that takes from 1000 sources but 3 of them more than the others, so the model identify those 3, but the other ones might matter just as much in aggregate?
It says that the decomposition is linear. Can you remove a given prototype and infer again without it? That would be really cool.
In prism, for any token the model generates, you can say, it generated this token based on these sources. During training, the model is 'forced' to match all the prototypes to specific tokens (or group of tokens) in the data. The prototype itself can actually be exactly match to a training data point. Think of it like clustering, the prototype is a stand-in for training data that looks like that prototype, we force (and know) how much the model will rely on that prototype for any token the model generates.
The demo in the post is not as granular because we don't want to overwhelm folks. We'll show granular attribution in the future.
I agree that attribution is most useful for debugging and auditing. This is a prime usecase for us. We have a post with exciting results lined up to do this. Should be out in a week, we wanted to even just get the initial model out :)
The demo just says "Wikipedia" or "ArXiV". That's pretty broad and maybe not that useful. Can it get more specific than that, like the actual pages?
How granular can you get the source data attribution? Down to individual let's say Wikipedia topics? Probably not urls?
Would be interested to see this scale to 30/70b
Having said that, I worry that you run into Illusion of Conscious issues where the model changes attrition from “sandbagging” to “unctuous” when you control its response because the response is generated outside of the attribution modules (I don’t quite understand how cleanly everything flows through the concept modules and the residual). Either way this is a sophisticated problem to have. Would love to see if this can be trained to parity with modern 8B models.
Example — MIT:
> Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions
I haven't looked at the project that much but this could seem exciting to me if maybe these two things can get merged.
I don't think that license is necessarily the problem in here. Licenses can change and people can adopt new licenses.
The key design choice is that it does not automate anything. The agent surfaces a prompt, the user decides yes or no. No bulk starring, no forced actions. The spec also deliberately stays out of licensing territory. It is purely a social recognition layer.
It is at v0.1 and no agent supports it yet, but the spec and schema are published and open for feedback: https://github.com/attributionmd/attribution.md
Oh? You are under the impression that software gets the same rights and privileges of humans?
Or maybe you are under the impression that you are so special that you face no danger from having no income because the models already ingested all your work and can launder it effectively?
IMO Nothing and nobody starts out original. We need copying to learn, to build a foundation of knowledge and understanding. Everything is a copy of something else (or put another way, art is more like a sum of your influences). The only difference is how much is actually copied, and how obvious it is.
And in the US at least, from a legal perspective, this "how obvious is it" subjective test is often one way that copyright disputes are settled.
For example there have been many cases of similar sounding songs that either did in fact draw an influence from an existing track (whether consciously or not), or were more likely just coincidental... but courts have ruled both ways in such cases, even if they sound extremely similar.
But there might be bad, malicious articles on ArXiv, so it doesn't really say anything about veracity.
Perhaps this might help to detect some problems like prompt injection - but then it might be more interesting to see those examples.
Given the example I saw about CRISPR, what does this model give over a different, non explaining model in the output ? Does it really make me more confident in the output if I know the data came from Arxiv or Wikipedia ?
I find the LLM outputs are subtlety wrong not obviously wrong
In the example it shows how much of the reason for an answer is due to data from Wikipedia. Can it drill down to show paragraph or sentence level that influences the answer ?
I believe that the plagiarism complaint about llm models comes from the assumption that there is a one-to-one relationship between training and answers. I think the real and delightfully messier situation is that there is a many-to-one relationship.
It can tell you which specific text (chunk) in the training data that led to the output the model generated. We plan to show more concrete demos of this capability over the coming weeks.
It can tell you where in the model's representation it learned about science, art, religion etc. And you can trace all of these to either to input context, training data, or model's representations.
That's still firmly in divination land.
Steerling appears to be just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder (a common interpretability layer) before the LM head.
They also use a loss that aligns the SAE'S activations with labelled concepts? However, this is an example of "The Most Forbidden Technique" [1], and could make the model appear interpretable without the attributed concepts actually having causal effect on the model's decisions.
1: https://thezvi.substack.com/p/the-most-forbidden-technique
1) The is not an SAE in the way you think. It is a combination of a supervised + unsupervised layer that is constrained. An SAE is typically completely unsupervised, and applied post hoc. Here, we supervise 33k of the concepts with concepts that we carefully curated. We then have an unsupervised component (similar to a topk SAE) that we constrain to be independent from the supervised concepts. We don't do any of this post hoc by the way; this is a key constraint. I"ll get back to this. We train that unsupervised layer along with the model during pre-training.
2) Are the concepts or features causally influential for the output? We directly use the combination of the concepts for the lm head, which is a linear transform (with activation), so we can tell you, in closed form, the effect of ANY concept on the output logit for any token (or group of tokens) generated. It is not just causally related, it is constrained to do so.
3) Other points: we also make it so that you can trace the model outputs to the training data. This is an underrated interpretability knob. You know where, and what data, caused your model to learn a particular feature.
This is already a long comment, but I want to close on why our approach sidesteps all the issues with SAEs. - If you train an SAE twice, on the same data + model, you'll get two different feature(s). - In fact, there is no reason, why the model should pick features that are causally influential for the output. - ALL of these problems stem from the fact that the SAE is trained AFTER you already trained your model. Training from scratch AND with supervision allows you to sidestep these issues, and even learn more disentangled representations.
Happy to more concretely justify the above. Great observations!
Perhaps this could be a step in that direction. If we can associate the attribution with likelihood of being true. E.g., Arxiv would be better than science fiction in that context. But what is the attribution if it hallucinates a citation? Im guessing it would still be attributing it to scientific sources. So it does nothing to fix the most damaging instances of hallucination?
I could find this [0], but not sure if that represents the entire system? (Apologies, I am not that well versed in ML)
[0] - https://www.guidelabs.ai/post/scaling-interpretable-models-8...
Suppose somebody put an LLM in charge of an industrial control system and it increased the temperature so much that it caused an accident. The input feature attribution analysis shows that the model was strongly influenced by the tokens describing the temperature control mechanism, concept attribution shows that it output tokens related to temperature, industrial processes and LLM tool-call syntax.
The operator proposes to fix this by rewriting the description and downweighting the temperature concept in the output, and a simulation shows that with these changes the model doesn't make the same decisions in this situation anymore. Should the regulator accept this explanation as sufficient to establish that the system is now safe?
If the controller has just a few parameters and responds approximately linearly to changes in its inputs, you can in principle guarantee that it'll stay within a safe zone. But LLMs have a huge number of parameters and by design highly nonlinear behavior. A simple explanation is unlikely to reflect model behavior accurately enough that you can trust its predictions to hold in arbitrary situations.
We'll see.
Just to give you some answers for what we can do:
1) We can find the training data that is causing a model to output toxic/unwanted text and correct it. 2) We know what high level concepts the model is relying on for any group of tokens it generates, hence, reducing that generation is as simple as toggling the effect of the output on that concept.
Most of the AI safety techniques fall under finetuning. Our model allows your to do this without fine-tuning. You can toggle the presence of .
For example, wouldn't you like to know why a model is being sycophantic? Or Sandbagging? Is it a particular kind of training data that is causing this? Or is it some high level part of the model's representations? For any of this, our model can tell you exactly why the model generated that output. Over the coming weeks, we'll show exactly how you can do this!
I feel this would give insight into all that including the degree of true conceptualisation. I’m curious if this can demonstrate what else the model is aware of when answering, too.
We can also trace its behavior to the training data that led to it, so that can show us where some of these concepts are formed from.
Actually, emphatically no. The only thing I care about is that I have recourse. It shouldn't matter the reason, in fact explainability can be an impediment to accountability. It's just another plausible barrier to a remedy that a bureaucracy can use deny changing a decision.