ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 6.00 3.00 2914
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (25%) 8.00 3.00 3401
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 2 (50%) 5.00 2.50 2102
Total 4 (100%) 6.00 2.75 2630
Title Ratings Review Text EditLens Prediction
Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes to extend RoPE by including the imaginary component when interpreting the computation of attention weights as multiplications on complex vectors. The claims are mainly the following: - "... the negative imaginary part of attention, shows that, compared with the real attention exhibiting stronger semantic locality, the imaginary heads attend more to long-context information as shown in Figure 1, promising gains on long-context tasks" - RoPE++ outperforms RoPE in the ablations with models of sizes 776M and 376M. - one of the variant of RoPE++ reduces the KV cache by half but delivers on par performance. 1. The idea is novel. The authors had the good observation that RoPE only captures the real part when interpreted as complex multiplications. 2. Good theoretic motivations and explanations. 3. The evaluation uses a good set of benchmarks and looks convincing given their training horizon (50B tokens) and their scale, though I'm skeptical how good it will continue to be if we train for longer for reasonable amount of tokens. 1. Experiment setup doesn't seem to have enough scale to convincingly show the gain in model pre-training. 50B token is unfortunately sometimes too small to have confidence about certain pre-training signals, though I understand that there is typically a budget issue in academia. 2. I'm skeptical about whether the setup is bug-free and uses reasonable hyperparameters, as the training for ALiBi and NoPE seem to have to take compromises. 1. In the experiment section, you mentioned "Since ALiBi and NoPE train unstably at 4k, as we have tried, we train them on 1k context length while keeping the batch size the same.". This made me very skeptical about whether the parameters you fix is reasonable, and your experiment setup is correct. In my experiences, with reasonable effort, all these positional encoding can be trained without problems up to tens of billions of parameters for trillions of tokens. Have you carefully debugged your experiments? Have you tracked some basic training health metrics across layers? Have you considered redoing and comparing with some reproducible pre-training recipes such as OLMo/Pythia/... to ensure no regression due to bugs and mistakes? Have you sweeped hyperparameters for the architectures you chose? 2. Do you have experiment to support the "real part having better locality and imaginary part having better dependency capture", rather than the hand-wavy point-wise argument by only looking at magnitudes? This could become a very interesting data point, regardless whether the experiment supports or rejects the claim. Fully human-written
Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes RoPE++, an extension of Rotary Position Embeddings (RoPE) designed to enhance the modeling of long-context dependencies in Large Language Models (LLMs) utilizing imaginary component of RoPE. It analyses how it affects cache and parameters efficiency, impacts length extrapolation. - proposes new method for positional encoding which shows promise, especially for long context - performs extensive experimentations on different datasets - includes several positional encoding methods as baseline - would be great to have baseline for RoPE method, which target long context - theoretical justification of method could be improved Could you please clarify the changes (if any) in the number of parameters, activations and FLOPs (estimate) for RoPE++ methods? Am I understanding correctly that Wq parameters are shared between imaginary and real heads? What about Wo size? Considering that RoPE++ performs well on long-context task, would be great to have baseline with one of the derivate RoPE methods, which target long context (i.e one of Linear Position Interpolation (PI), NTK Scaling, YARN, LongRoPE, p-RoPE) In section 5.1, is analysing inference time only? Could you add some information how RoPE++ affects training? line 193: "long context decay". Could you please clarify what distribution of queries and keys you are considering? As shown in https://arxiv.org/abs/2410.06205v1 decay is clear if RoPE is applied to constant queries and keys, but not so much if queries and keys are Gaussian. nit: line 256. "share the same parameter" same parameters? Fully human-written
Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes RoPE++, an extension of Rotary Position Embeddings (RoPE) that reintroduces the imaginary component of complex-valued attention scores, which standard RoPE discards. The authors argue that this imaginary component contains valuable phase information for long-range dependencies. They propose two variants and the method is evaluated different sized parameter models using short-context and long-context benchmarks . 1. Novel perspective: Identifying and addressing the discarded imaginary component in RoPE is creative and theoretically motivated. The observation that imaginary attention captures longer-range dependencies is interesting. 2. Theoretical justification: The paper provides mathematical grounding (Equations 2-5) showing that imaginary attention follows a sine integral characteristic curve, complementing the cosine integral of real attention. 3. Generalization: Method generalizes to diffusion/bidirectional attention models. 3. Dual efficiency benefits: RoPE++_EH achieving comparable performance with half the KV cache is valuable for long-context scenarios. 4. Comprehensive evaluation: Testing across multiple model sizes, benchmarks, and ablations (attention pattern analysis, noise injection experiments) strengthens the empirical validation. 1. Limited scale: Experiments only go to 700m parameters, which is quite small by modern LLM standards. It's unclear if benefits hold at 7b+ scale where most practical long-context work happens. 2. Modest improvements: Performance gains are often marginal. In Table 2, RoPE++_EC only outperforms RoPE by ~1-2 points on average. 3. No plug-and-play extrapolation: The authors acknowledge (Section 5.3, Limitation) that RoPE++ doesn't provide direct length extrapolation like other methods, limiting practical applicability. 4. Incomplete comparisons: No comparison with recent long-context methods (e.g., YaRN, which is cited but not compared against after long-context training) ALiBi and NoPE trained at 1k context vs. 4k for others makes comparisons less fair. 5. Incomplete ablations: No ablation study of the contribution of the imaginary component. For example, we could use 75% imaginary and 25% real. 1. Have you tested or do you have plans to validate RoPE++ at 7b or larger scales? Are there theoretical or practical reasons to expect different behavior? 2. Can RoPE++ be combined with other long-context techniques? (YaRN, sparse attention, etc.) Have you explored this? 3. Does RoPE++ converge at similar rates to standard RoPE? Are there any training stability issues? Could you show pretraining loss curves? 4. What happens if we only apply the imaginary part of the RoPE? What would be the effects of doing this? 5. For Figure 5's noise experiment, what are the results with noise added to both components simultaneously? 6. What happens if you use different rotation angles (not -π/2) for the imaginary component? Fully AI-generated
Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper identifies that standard Rotary Position Embeddings (RoPE) discard the imaginary component of complex-valued dot products during attention score calculation(this component potentially containing valuable phase information relevant to long-range dependency modeling). To address this limitation, the authors propose **RoPE++**, an extension that re-incorporates the imaginary component to enable a full complex-valued representation of attention scores, forming dual-component (real + imaginary) attention. 1. The proposed **RoPE++$_{EH}$** configuration achieves remarkable efficiency: it maintains the same number of attention heads as vanilla RoPE while halving KV cache and QKV parameters, which is highly valuable for long-context LLMs where memory constraints are critical. 2. Empirically, RoPE++ demonstrates consistent advantages over vanilla RoPE across both short-context tasks (e.g., WikiText perplexity, Open LLM Leaderboard benchmarks) and long-context tasks (e.g., RULER, BABILong up to 64k context), validating its generality and effectiveness. 1. RoPE++ requires full pre-training (or continuous long-context pre-training) to realize its performance gains and cannot be applied as a "plug-and-play" module—this limits its flexibility for scenarios where re-training is computationally prohibitive or unavailable. 2. The **RoPE++$_{EC}$** configuration, while delivering superior long-context performance, doubles the number of attention heads (relative to vanilla RoPE) under fixed KV cache size, introducing additional computational overhead that may offset its performance benefits in resource-constrained settings. 1. In Section 3.2, the authors argue that the imaginary component of RoPE++ attends to more distant positions, and Figure 1 supports this by showing that the imaginary attention’s magnitude gradually becomes more prominent for Δt > 25 (facilitating long-range dependency modeling). However, the curve exhibits unexpected behavior near Δt = 0 (a sharp transition) and for Δt < 25 (magnitude first decreases then increases). Could the authors elaborate on whether this short-range fluctuation affects RoPE++’s ability to preserve semantic aggregation ? For example, does it disrupt local semantic coherence when modeling short-range token relationships? 2. For RoPE++$_{EC}$: A single query vector (q) undergoes two positional encodings (one for real attention, one for imaginary attention) before computing dot products with keys (k). Do the authors observe any information representation conflicts between the two sets of dot products? If so, how does the model resolve such conflicts to avoid degrading attention quality? If not, what mechanisms (e.g., parameter sharing, attention head specialization) ensure complementary rather than redundant information from the two encodings? 3. Existing RoPE extension methods (e.g., NTK-PI, YaRN, LongRoPE) modify RoPE’s behavior via dimension-wise adjustments (e.g., scaling rotary bases, interpolating index ranges, partitioning feature dimensions) to enhance length extrapolation. Could the authors clarify whether these dimension-level modifications are compatible with RoPE++’s imaginary component extension? For instance, do such methods introduce conflicts in how positional information is encoded across real/imaginary components, or can they be combined to further improve long-context performance? Moderately AI-edited
PreviousPage 1 of 1 (4 total rows)Next