I could find this [0], but not sure if that represents the entire system? (Apologies, I am not that well versed in ML)
[0] - https://www.guidelabs.ai/post/scaling-interpretable-models-8...
I agree that attribution is most useful for debugging and auditing. This is a prime usecase for us. We have a post with exciting results lined up to do this. Should be out in a week, we wanted to even just get the initial model out :)
Just to give you some answers for what we can do:
1) We can find the training data that is causing a model to output toxic/unwanted text and correct it. 2) We know what high level concepts the model is relying on for any group of tokens it generates, hence, reducing that generation is as simple as toggling the effect of the output on that concept.
Most of the AI safety techniques fall under finetuning. Our model allows your to do this without fine-tuning. You can toggle the presence of .
For example, wouldn't you like to know why a model is being sycophantic? Or Sandbagging? Is it a particular kind of training data that is causing this? Or is it some high level part of the model's representations? For any of this, our model can tell you exactly why the model generated that output. Over the coming weeks, we'll show exactly how you can do this!
I feel this would give insight into all that including the degree of true conceptualisation. I’m curious if this can demonstrate what else the model is aware of when answering, too.
We can also trace its behavior to the training data that led to it, so that can show us where some of these concepts are formed from.
SHAP basically does point by point ablation across all possible subsets, which really doesn't make sense for LLMs. This is simultaneously too specific and too general.
It's too specific because interesting LLM behavior often requires talking about what ensembles of neurons do (e.g. "circuits" if you're of the mechanistic interpretability bent), and SHAP's parameter-by-parameter approach is completely incapable of explaining this. This is exacerbated by the other that not all neurons are "semantically equal" in a deep network. Neurons in the deeper layers often do qualitatively different things than earlier layers and the ways they compose can completely confuse SHAP.
It's too general because parameters often play many roles at once (one specific hypothesis here is the superposition hypothesis) and so you need some way of splitting up a single parameter into interpretable parts that SHAP doesn't do.
I don't know the specifics of what this particular model's approach is.
But SHAP unfortunately does not work for LLMs at all.
Here is what this model does: it `rewrites` the model's activations (during pre-training) into supervised + unsupervised concepts that are then decoded into tokens. So at pre-training, we constrained the model with 33k supervised concepts (e.g., sports, toxicity, alignment, demographic variables), and then have more (101k) unsupervised concepts for the model to learn as well.
Overall, the architecture and loss functions of this model allow you to answer the following questions: 1) Which token in the context caused a chunk (group of tokens) to be generated? 2) which high level concept (supervised or unsupervised) caused the 3) perhaps more interestingly, in a single forward pass, we can tell you which training chunk led to the output of the model as well.
We do all of this for the single steerling model which is 8B parameters trained on 1.5T tokens. First time any model of this scale has achieved this level of interpretability by design.
would be happy to answer more questions.
Suppose somebody put an LLM in charge of an industrial control system and it increased the temperature so much that it caused an accident. The input feature attribution analysis shows that the model was strongly influenced by the tokens describing the temperature control mechanism, concept attribution shows that it output tokens related to temperature, industrial processes and LLM tool-call syntax.
The operator proposes to fix this by rewriting the description and downweighting the temperature concept in the output, and a simulation shows that with these changes the model doesn't make the same decisions in this situation anymore. Should the regulator accept this explanation as sufficient to establish that the system is now safe?
If the controller has just a few parameters and responds approximately linearly to changes in its inputs, you can in principle guarantee that it'll stay within a safe zone. But LLMs have a huge number of parameters and by design highly nonlinear behavior. A simple explanation is unlikely to reflect model behavior accurately enough that you can trust its predictions to hold in arbitrary situations.
Given the example I saw about CRISPR, what does this model give over a different, non explaining model in the output ? Does it really make me more confident in the output if I know the data came from Arxiv or Wikipedia ?
I find the LLM outputs are subtlety wrong not obviously wrong
In the example it shows how much of the reason for an answer is due to data from Wikipedia. Can it drill down to show paragraph or sentence level that influences the answer ?
I believe that the plagiarism complaint about llm models comes from the assumption that there is a one-to-one relationship between training and answers. I think the real and delightfully messier situation is that there is a many-to-one relationship.
It can tell you which specific text (chunk) in the training data that led to the output the model generated. We plan to show more concrete demos of this capability over the coming weeks.
It can tell you where in the model's representation it learned about science, art, religion etc. And you can trace all of these to either to input context, training data, or model's representations.
curious how the performance compares to a standard llama 8b on benchmarks - interpretability usually comes with a quality tax.
Here are a bunch of questions you can answer without any quality degradation with interpretable models: 1) what part of the input context led to the output chunk that the model generated? 2) what part of the training data led to the output chunk?
In this case, we go more invasive, and actually constrain the model to also use human understandable concepts in its representations. You might think this leads to quality trade-offs. However, if you allow for the model to discover its own concepts as well (as long as they are not duplicates of the concepts you provided it), you don't see huge degradation.
I agree with the other commenters that this now gives us a huge boost in debugging the model.
One side effect that we are excited about is that interpretable model training might make for a data efficient training process.
Steerling appears to be just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder (a common interpretability layer) before the LM head.
They also use a loss that aligns the SAE'S activations with labelled concepts? However, this is an example of "The Most Forbidden Technique" [1], and could make the model appear interpretable without the attributed concepts actually having causal effect on the model's decisions.
1: https://thezvi.substack.com/p/the-most-forbidden-technique
1) The is not an SAE in the way you think. It is a combination of a supervised + unsupervised layer that is constrained. An SAE is typically completely unsupervised, and applied post hoc. Here, we supervise 33k of the concepts with concepts that we carefully curated. We then have an unsupervised component (similar to a topk SAE) that we constrain to be independent from the supervised concepts. We don't do any of this post hoc by the way; this is a key constraint. I"ll get back to this. We train that unsupervised layer along with the model during pre-training.
2) Are the concepts or features causally influential for the output? We directly use the combination of the concepts for the lm head, which is a linear transform (with activation), so we can tell you, in closed form, the effect of ANY concept on the output logit for any token (or group of tokens) generated. It is not just causally related, it is constrained to do so.
3) Other points: we also make it so that you can trace the model outputs to the training data. This is an underrated interpretability knob. You know where, and what data, caused your model to learn a particular feature.
This is already a long comment, but I want to close on why our approach sidesteps all the issues with SAEs. - If you train an SAE twice, on the same data + model, you'll get two different feature(s). - In fact, there is no reason, why the model should pick features that are causally influential for the output. - ALL of these problems stem from the fact that the SAE is trained AFTER you already trained your model. Training from scratch AND with supervision allows you to sidestep these issues, and even learn more disentangled representations.
Happy to more concretely justify the above. Great observations!
We'll see.