|
OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents a regularizer for sparse autoencoders. It encourages orthogonality between the weights of the different latents by:
1) sampling random blocks of latents
2) computing and penalizing cosine similarity between different weights within each block.
This method is cheap (a stated advantage over Matryoshka-SAEs).
In experiments, it seems effective at reducing feature absorption and composition, which is the stated motivation, and performs competitively in other regards.
The problem is significant and timely, as Sparse Autoencoders are a significant and active area of research, and this work attacks some substantial, qualitative issues with common approaches.
I expect a method like this to be a standard baseline for other approaches, as it is simple, cheap, and appears effective according to various evaluations.
The paper is well written, with only a few minor issues.
(moderate): When I google "orthogonal autoencoder", I encounter several prior works, e.g. "Orthogonality-Enforced Latent Space in Autoencoders: An Approach to Learning Disentangled Representations", Cha & Thiyagalingam, ICML 2023. I'm not sure how similar any of these works actually are, but I think the submission should include references to this and any other such works in any case, eg to reassure the reader if they are in fact quite different if that is in fact the case. If they are substantively similar, I don't think it's a major problem for this work, it just needs to be noted.
(moderate): I feel like Figure 5 (and Appendix I) are missing similar decomposition results for other methods, unless I misunderstand, there is no side-to-side comparison included at the moment.
(moderate): The technique seems like a blunt tool. I'm not sure how competitive such an approach will prove to be compared to alternatives that tackle the problems more head-on. The paper doesn't explain how encouraging orthogonality is expected to change what the SAE ends up learning in any detail; a more thorough analysis would be welcome. The submissions states: "Feature absorption and composition lead to redundant representations where multiple latents capture overlapping concepts, which results in high cosine similarities between them. This suggests that enforcing orthogonality between SAE latents could provide a principled approach to mitigate these issues." The first sentence of this quotation is unsupported, and the second sentence doesn't sound very principled -- the way it's described, this method is treating a symptom of the problem, rather than the problem itself. I believe the submission could strengthen the motivation here.
(minor): The statement ‘Sparse Autoencoders (SAEs) have gained significant traction for interpreting LLMs, addressing the challenge that LLMs often function as “black boxes”’ is vague and is suggestive of a higher level of effectiveness in addressing the “black box” issue that I think is warranted.
(minor): I didn't find the descriptions of feature absorption very clear, and had to look at the referenced paper to understand this concept. Relatedly, Section 3.2 felt repetitive with previous content. I recommend focusing on explaining the issue more thoroughly there, and editting to remove redundant content.
(nit): “penalize high cosine similarities between SAE latents” It would make more sense to say
you're penalizing similarities between the weights.
(typo): Iine 420: Appx. D. should link to Appendix I.
Do you have any explanation of the qualitatively different results on different downstream benchmarks (Figure 6)?
Is the Cross-Model Feature Overlap Analysis sensitive to the threshold of .2?
In Figure 3d, the performance of different methods is quite different for high sparsity, with Relu-SAE performing the best and OrtSAE the worst. Do you have an explanation for this? Do you think this is a potential problem for the method?
Isn't the ultimate goal interpretability? With absorption and composition being relevant inasmuch as they make the results less interpretable? Given that the OrtSAE only achieves comparable levels of interpretability according to Figure 3d, I wonder if it’s actually achieving the aim of the work?
“The main contribution of OrtSAE is the introduction of a new orthogonalization penalty” -- Are there others?
Since the method is introduced as an approximation, it would be interesting to run some experiments that don't involve approximation to see if they in fact yield better results.
Is it necessarily the case that "Feature absorption and composition" result in "high cosine similarities between [latents]"? |
Fully human-written |
|
OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes Orthogonal Sparse Autoencoders that add an orthogonality penalty on decoder atoms to mitigate absorption and composition while keeping a classic SAE training loop. The penalty acts on the maximum cosine similarity within random chunks of the dictionary, which brings the cost near linear in the number of atoms. Experiments on Gemma 2 and Llama 3 layers compare OrtSAE to ReLU SAE, BatchTopK, and Matryoshka. Core metrics show slightly lower explained variance than BatchTopK but improved mean nearest neighbor cosine, reduced composition and reduced absorption, with SAEBench reporting a gain on spurious correlation removal and broadly similar performance elsewhere. Figures 1, 3 and 4 plus Appendix C document these effects and the chunking tradeoff.
The paper is well written, I would like to thank the author for that, and i noticed some really positive points (P) that I will describe here:
P1. Clear objective that is easy to implement. Section 3.3 gives a concrete loss with a practical chunk strategy, and the design slots into existing BatchTopK code paths. The result is near random init orthogonality on the mean nearest decoder cosine in Figure 3c.
P2. Sensible empirical sweep. The paper reports reconstruction and KL scores, interpretability via Autointerp, atomicity via composition and absorption, and downstream SAEBench. Figures 3 and 4, plus layer and chunk ablations in Appendix B and C, help readers understand where the gains come from.
P3. Useful qualitative decompositions. Figures 5, 8, 9 and 10 illustrate how a composite or overly narrow BatchTopK feature can be expressed as a sparse mix of more atomic OrtSAE features.
Now, even if I liked the paper, in my opinion, there is still some weaknesses, some Major (M) and some minor (m) that I will detail here:
M1. Reconstruction versus sparsity does not garantee the right concept basis. The paper still optimizes explained variance plus sparsity and adds orthogonality, but none of these objectives ensure recovery of the correct semantic axes. I think SAEBench go in the right direction here, but I would also like to see simple synthetic data with planted factors, to test subspace recovery and concept identifiability.
M2. Phenomenology of orthogonality is under motivated. Section 3 argues that absorption and composition correlate with high decoder cosine and therefore orthogonality should help, which is plausible, but there is a broader literature on representation geometry that could predict when orthogonality should or should not help. A short review and positioning would help justify the method theoretically, and it would also guide where to apply it in depth.
M3. Global orthogonality versus conditional orthogonality at inference. The current loss pushes all decoder atoms apart. For downstream use, what we actually need is low mutual coherence among the subset of atoms that activate together on a given input. Please consider a conditional version that penalizes the Gram matrix of only the active atoms per batch, or an evaluation that measures conditional mutual coherence at inference. This could preserve reconstruction while targeting the failure mode more precisely.
M4. To continue on this, you should use the Babel score [1]. Since the paper cares about sets of correlated atoms, Babel and cumulative coherence are natural complements to mean nearest neighbor cosine. Please report Babel on the whole dictionary and on active subsets during inference, and compare to BatchTopK and Matryoshka at matched sparsity. This will also clarify whether orthognality is achieved where it matters most.
[1] Greed is good: algorithmic results for sparse approximation. Tropp & al
M5. Missing baseline on MP-SAE. Matching pursuit and orthogonal matching pursuit induce a conditional orthogonality effect at selection time, which is directly relevant to absorption and composition. As such, MP-SAE [2] is extremely relevant here. Without it the geometric contribution is hard to isolate.
[2] From Flat to Hierarchical : Extracting Sparse Representations with Matching Pursuit. Costa & al.
M6. Connection to Grassmannian frames and linear representation hypotheses. If the intended target is a dictionary that minimizes mutual coherence for a given size, this is close to Grassmannian frames [3] or even equiangular tight frame objectives [4]. The paper could formalize this perspective, relate the proposed loss to a relaxation of Grassmannian frame construction, and test whether learned dictionaries move toward frame like spectra. A useful justification would be: if the linear representation hypothesis holds at the chosen layer, then an approximately Grassmannian decoder is preferred, hence the proposed loss is a practical surrogate. Empirically, the singular value spread of W_dec and the off diagonal structure of the Gram matrix could be tracked across training.
[3] On the structures of Grassmannian frames, Haas & al.
[4] On the existence of equiangular tight frames, Sustik & al
M7. Explain why reduced cosine should yield better concepts rather than just different ones. Figure 3c shows much lower nearest neighbor cosine and Figure 4 shows better atomicity metrics, but a causal story is missing. Please add targeted interventions where you hold reconstruction constant and vary the orthogonality weight. This feel circular.
M8. Scope of downstream improvements is modest. SAEBench shows a clear gain on spurious correlation removal but other tasks are similar to baselines or favor Matryoshka. I would not expect large universal gains from pure geometry, but the paper could better delineate which tasks benefit and why, possibly linking to the conditional orthogonality hypothesis above. See Figure 6.
Now for the minor points:
m1. Consider reporting conditional Gram statistics. For each batch, compute the Gram of active atoms and report the distribution of off diagonal entries. This directly measures the property the method aims to improve.
m2. Clarify the chunk size choice. Section 3.3 sets K as ceiling of m over 8192. A short sensitivity table on chunk size versus reconstruction and atomicity could help but I agree would be bonus.
m3. Add a short note on the slight explained variance drop. Figure 3a shows OrtSAE is a little below BatchTopK. A one paragraph discussion of when that trade is acceptable for users focused on interpretability would be helpful.
Cf Major point 1-8 and minor points 1-3. |
Fully human-written |
|
OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper tries to address the symptoms of feature absorption and feature composition via adding an orthogonality loss to SAE training. This loss term can be added to any SAE architecture, and penalizes the max cos sim of each latent with all other latents. The paper provides an optimized version of the algorithm that allows this to scale linearly with decoder size rather than quadratically. The paper shows that BatchTopK SAEs trained with this orthogonality loss achieve better scores on absorption and several other downstream metrics compared to standard SAEs, and are competitive with Matryoshka SAEs.
- The logic behind the technique makes sense, and the training optimization also makes sense
- I like the MetaSAEs-based compositionality score, that seems like a good contribution too.
- The technique is simple and can be easily tacked-on to existing SAE architectures
- The metrics are impressive, and seems like a good architecture to add to the SAE toolbox
- The LLM SAE training details all look very reasonable (enough tokens, reasonable width, reasonable dataset choice, reasonable learning rate, etc...)
- The paper does not explore the sensitivity of the SAEs to the parameter $\gamma$. In the paper, $\gamma$ is set to 0.25, but it's not clear how important this specific value is or how it was chosen.
- The paper makes an implicit assumption that "true features" should be very close to orthogonal, but there is some evidence this is not always the case. E.g. there has been work showing that days of the week are represented in a circle in a 2d plane [1], so the cosine similarity between these feature directions will naturally be high. Regardless, this assumption should be made explicitly and implications should be discussed.
- The plots don't have error bars or stdev. This isn't a huge deal for plots where there's an obvious trend, but for bar plots or plots that look noisy (e.g. sparse probing) this would be helpful to have.
- It's unclear when one would choose to use OrtSAEs over Matryoshka SAEs based on the results.
### References
[1] Engels, Joshua, et al. "Not All Language Model Features Are One-Dimensionally Linear." The Thirteenth International Conference on Learning Representations.
- Is there a "rule of thumb" for how to set $\gamma$ ? What happens if it's set too high, and what is a reasonable range for it? As a practitioner training OrtSAEs this is very important to know. I'm worried that setting this incorrectly can easily break the SAE.
- Figure 1 feels a bit misleading since it's not showing Matryoshka SAEs. Matryoshka SAEs are the obvious comparison to make. It feels like this was left out because it makes OrtSAEs look less impressive when compared with Matryoshka.
- What are the group sizes used for the Matryoska SAEs in this paper? I didn't see it specified anywhere, including the appendix. I think the number of groups and size of the groups will also impact the results for Matryoshka SAEs. |
Fully human-written |