ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 3 (100%) 4.00 3.67 2335
Total 3 (100%) 4.00 3.67 2335
Title Ratings Review Text EditLens Prediction
Outrageously Large Context Windows via RACE Attention -- A Family of Non-Linear Attention that can be calculated in Strictly Linear-Time Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces RACE Attention, a method to address the quadratic time and memory complexity of standard softmax attention. The authors propose replacing the exponential softmax kernel with a high-degree monomial of an angular (cosine) similarity kernel. This specific kernel choice allows them to leverage Locality Sensitive Hashing (LSH) and Repeated Arrays-of-Count Estimators (RACE) sketches to compute the attention output in linear time and space complexity. 1. The primary contribution and strength of this paper are the scaling results. Figure 5 shows that RACE on a CPU can outperform FlashAttention on a high-end GPU at massive sequence lengths, is a compelling demonstration of the algorithm's effect over hardware acceleration. 2. The paper is well-written and easy to follow. 3. The theoretical result also provides a nice bias-variance trade-off of their approach. 1. The paper seems to be lacking some important baselines. The authors compare their result to FlashAttention, however, at the moment FlashAttn 2 and 3 are also available that performs much faster and are not included in the comparison. Moreover, the paper focuses on alternatives to softmax and is for example lacking a comparison to Sigmoid Attention which also provides a simple kernel implementation. 2. The paper is a bit vague and ambiguous in their main algorithm. The authors argue that they use cosine kernel to prevent the exponential of softmax and be able to use RACE sketch. However, it seems that Algorithm 1 is still trying to implement softmax. Am I misunderstanding this? Technically, it seems that the connection between the features $\phi$ and the angular attention is never clearly made. 1. Can authors elaborate on how to choose $\gamma$? Would it be through a hyperparameter search or is there a principled way of approximating a good value for it? 2. Once more question on $\gamma$, could authors provide any sensitivity analysis of how the final result changes with respect to the small changes in $\gamma$? Perhaps another useful figure would be to use the data from Fig 2 and plot the distribution of the attention distances between softmax and the angular attention to see how it varies as $\gamma$ is changed. Fully human-written
Outrageously Large Context Windows via RACE Attention -- A Family of Non-Linear Attention that can be calculated in Strictly Linear-Time Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a novel linear-time attention mechanism. The approach replaces the exponential softmax kernel with a monomial of cosine similarity raised to a power, enabling approximation through randomized projections. By leveraging angular similarity, Locality-Sensitive Hashing, the authors propose an efficient that enables outrageously large context windows up to 75 million tokens on CPUs and 12 million on GPUs. 1. This method enables linear-time and memory-efficient attention that scales to tens of millions of tokens on standard hardware, which is impressive. 2. The algorithm is simple, differentiable, and can serve as a drop-in replacement for softmax attention. 1. **This paper is very similar to YOSO [1] (for example, the finding the similarity between equation (1) and (2) in the text, the use of LSH in estimating the similarity function, the algorithm of estimating attention outputs via hashtables), but this paper does not discuss and contrast with [1].** 2. The experiments only show model accuracy on short sequence lengths (< 8K). What about longer sequences? 3. The efficiency results in Figure 3 are not very meaningful as any linear attentions can be extremely efficient by tuning their hyperparameters. For example, for $\phi(Q) \phi(K)^T$ type attention, by setting the output dimension of $\phi$ to be 1, its efficiency can beat any other methods. To show efficiency, the runtime and memory results should be coupled with the corresponding accuracy results. 4. Figure 5 has the same issue, what about the accuracy? **If the authors can address my concerns, I am willing to raise my score.** [1] Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh. You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling. ICML 2021. see weakness section. Fully human-written
Outrageously Large Context Windows via RACE Attention -- A Family of Non-Linear Attention that can be calculated in Strictly Linear-Time Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper describes RACE attention as a linear-time alternative to softmax attention for very long contexts. The main idea is to replace softmax with powers of angular similarity, and then approximate this term using RACE sketches. To do this, the algorithm uses soft LSH so that its differentiable. This achieves far reduced complexity versus quadratic for standard attention, as is common in most methods for self-attention approximation. What is nice is that the experiments are broad and cover language modeling, masked LM, and classification. In this context, scaling experiments show processing of tens of millions of tokens on CPU and GPU for a single attention layer's forward-backward pass. This will be the main highlight of this work for most readers. 1. The scaling experiments are quite impressive. Regardless of my other comments below, this is a good practical contribution. Also, it is interesting that CPU-based RACE is viable and in some regimes can do better than FlashAttention. This point about algorithmic efficiency versus hardware acceleration could really be a main message of the paper (more on this below). In any case, reaching 50M/75M tokens is definitely a strength (but in the current version of the paper, this comes with some disclaimer). 2. The experimental breadth is very good. Both CPU and GPU kernels with OpenMP are mentioned. This is a strong engineering effort and if code is provided, it can benefit many groups working in this area. 3. Experimental verification of how increasing degree can mimic exponential behavior in this setting is useful. Some analysis is included for the bias-variance to guide the choices in the sketching component. This is all good. 1. I am a bit confused by the numerous instances of "stress test" and therefore it unclear what the scaling experiments actually show. When stress testing 1 forward-backward pass with the multi-head attention layer, is this timing a single layer, not end-to-end model training? If so, the 75M token claim is for one attention operation, not training the full model? Is this paper only describing benchmarking the primitive or does any model work at these lengths? The reason for this question is the title "outrageously large context windows" -- is this only for the stress tests? The most reasonable reading of the title suggests full model capability. 2. I am having trouble understanding the tables on page 8. Is angular expected to be better than RACE? 3. The paper https://proceedings.mlr.press/v139/zeng21a.html uses related ideas and also seems motivated by similar upstream papers. Another one is https://aclanthology.org/2022.iwslt-1.4.pdf. The positioning of this work on page 4/5 should at least describe how they differ. 1. Minor: Is adapting the analysis to causal masking relatively easy (but hasn't been worked out yet) or does one run into problems? 2. check some of the references above. There may be others. Fully human-written
PreviousPage 1 of 1 (3 total rows)Next