ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	2 (50%)	2.00	4.00	2834
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	0 (0%)	N/A	N/A	N/A
Lightly AI-edited	0 (0%)	N/A	N/A	N/A
Fully human-written	2 (50%)	4.00	4.00	3302
Total	4 (100%)	3.00	4.00	3068

Title	Ratings	Review Text	EditLens Prediction
LONGSHIELD: SCALABLE DISTRIBUTED DIFFERENTIALLY PRIVATE TRAINING FOR LONG-CONTEXT LLMS	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	DP-SGD is the most used algorithm for training ML models with differential privacy. Prior literature (DP-ZeRO) has scaled DP-SGD training to very large models, but fails to scale to longer sequence lengths. Through a series of optimizations this paper scales from max sequence length of 4k (prior work) to 16k on the Llama-3 8B model. They use optimizations from the exisiting literature on LLM scaling such as (1) context parallelism, (3) gradient sharding (3) activation checkpointing, and adapt these to work with DP-SGD. - The paper achieves significant results in terms of scaling DP-SGD training to longer contexts. - Experimental results are comprehensive (although a bit hard to compare between methods given the separate tables for each method). - The paper provides important contributions in adapting popular scaling techniques such as context parallel and activation checkpointing to DP-SGD training, thus enabling further research in private LLM training. - I liked the insight on the limitations of ghost clipping for scaling to longer contexts. - The contributions are mainly engineering-oriented versus conceptual/algorithmic since existing methods from the non-private literature are extended somewhat straightforwardly to the private case. - The paper could be better self-contained. Several concepts are used without much explanation such as ghost clipping/ghost overhead, FSDP, context extension continue pre-training, the overlap of communication and computation in the input-stationary pattern. - Will there be open-source code for this paper? That would be very important given that a lot of the contributions are engineering-oriented and would enable ongoing research in this area. More minor comments: - What is the meaning of "large fragmentation" in in the current Opacus imlementation? - Another relevant work/baseline might be "Scaling Private Deep Learning with Opacus: Advances for Large Language Models" which discuss FSDP with the ghost clipping approach.	Fully human-written
LONGSHIELD: SCALABLE DISTRIBUTED DIFFERENTIALLY PRIVATE TRAINING FOR LONG-CONTEXT LLMS	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents LongShield, a distributed framework for differentially private (DP) training of large language models (LLMs) under long-context settings. The method integrates context parallelism (CP) with per-sample gradient computation to achieve long-sequence scalability while maintaining DP guarantees. 1. Addresses an underexplored but important problem: long-context DP training. 2. Demonstrates strong empirical results with practical improvements in scalability and throughput. 3. Provides a clear implementation pathway compatible with Opacus and TorchTitan frameworks. 1. The core idea (computing per-sample gradients under CP) is a direct extension of standard CP, replacing the local (p, d) gradient computation with (B, p, d) per-sample gradients. There is no new parallelism algorithm or architectural innovation beyond this dimensional extension. 2. The “input-stationary vs. output-stationary” trade-off is not new; it mirrors prior analyses in FlashAttention-2 and Megatron-Ulysses. The overlap strategy is a routine engineering optimization rather than a conceptual advance. 3. The paper emphasizes that FSDP cannot shard per-sample gradients while CP can, but this follows trivially from the existing data partitioning dimensions. It is a property of CP’s tensor layout, not a new design. 4. The hook management fix for Opacus checkpointing is more of an implementation detail than a contribution; it does not introduce a new checkpointing strategy or fundamental compatibility solution. 5. There is no formal analysis of privacy-utility trade-offs or communication complexity, and most claims rely purely on empirical evidence. 6. The authors do not justify why long-context DP training is necessary or beneficial. While long-context modeling is relevant for general LLMs, it remains unclear whether the same motivation applies to private training. The experiments focus on throughput and context length scaling but omit any analysis of model utility, privacy-utility trade-offs, or downstream performance (e.g., perplexity, zero-shot tasks). 1. Could the authors clarify whether any communication primitives or kernel fusion were re-implemented, or are all collectives reused from Megatron CP/Ulysses? 2. How does LongShield differ in actual code structure from integrating DP-SGD directly into existing CP frameworks? 3. Would privacy accounting or gradient clipping strategies change under model-parallel settings, or is the approach orthogonal to parallel dimension? 4. Why is long-context DP training practically needed? Are there real-world datasets or use cases that specifically require both privacy and long context?	Fully AI-generated
LONGSHIELD: SCALABLE DISTRIBUTED DIFFERENTIALLY PRIVATE TRAINING FOR LONG-CONTEXT LLMS	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces LongShield which integrates context parallelism to DP (differentially private) training method aka DP-SGD. The main contribution of the paper is to integrate context parallelism on top of FSDP (ZERO-DP+) for training LLMs with large context using differentially private algorithms such as DP-SGD. The paper also enables DP-safe activation checkpointing to extend context further. On Llama 3.1 8B with 4× NVIDIA H100 GPUs, LongShield scales sequence length from 4k to 16k compared to the state-of-the-art ZeRO-DP, achieves linear sequence-length scaling, shrinks the throughput gap from 67% to 8.9% while matching non-DP memory usage, and reaches a 64k context length with activation checkpointing. 1. The paper identifies that for large context lengths, ghost clipping introduces significant memory overhead and shits to computing pure grad sample to integrate context parallelism to ZERP-DP+ 2. LongShield keeps per-sample gradients shards local to each GPU to avoid full materialization, overlaps per-sample gradient aggregation with backward computation to sustain throughput. 3. The paper enables DP-safe activation checkpointing to extend context further. 4. The paper presents experimental results on Llama 3.2 1B, Llama 3.2 3B, and Llama 3.1 8B over 4× H100 GPU and show that LongShield scales sequence length from 4k to 16k compared to the state-of-the-art ZeRO-DP. 1. The main weakness of the paper is that LongShield computes per-gradient sample which is notorious for preserving the per-sample gradient over the entire model instead of using Fast Gradient Clipping (FGC) to avoid ghost clipping's $T^2$ overhead. This choice has not been justified if there is indeed some benefit (in-terms of throughput) in computing per-sample gradient of the entire model instead of FGC with two backward passes. 2. The paper was hard to follow, please add a pseudo-code similar to algorithm 1 for LongShield, ZERO-DP and ZERO-DP+. Please explain technical details of CP and how it integrates along with FSDP for DP-SGD more clearly. 3. No codebase is provided. Questions: 1. "We adopt the pure gradient-sample (GS) approach to avoid ghost overhead." -- This is unclear to me. I understand that Ghost Clipping introduces memory overhead proportional to $T^2$ but this can be eradicated by shifting to Fast Gradient Clipping. Why do we need to store the entire per-sample gradient instead of using Fasting Gradient Clipping + Two backwards passes if memory proportional to $T^2$ was indeed the bottleneck? 2. ZERO-DP works with per-layer gradient clipping which doesn't yield good utility in terms of test accuracy. For ZeRO-DP+ results, are the experiments conducted for flat clipping instead of per-layer? 3. In my opinion, integrating CP and FSDP into DP-SGD with Fast Gradient Clipping is trivial. If this is not the case, then can the authors explain why? 4. "For example, mixed ghost norm choose ghost for Llama 3.1 8B final linear layer up to T= 16k. However, the ghost norm is 4× more FLOPs than directly evaluating the per-sample gradient, and the final dot product between two large intermediate tensors (𝑂(𝐵𝑇2)) causes a similar time, according to our profiling, due to the reduction nature." -- this statement is not clear to me. Mixed GC will pick GC or FGC based on minimum overhead condition and GC has more FLOPs than FGC because 1) FGC includes one matmul of BxTxp and BxTxd, where as 2) GC has two matmuls (BxTxp with BxpxT, BxTxd with BxdxT) and one dot product ($B \times T^2$ with $B \times T^2$). The statement says that GC has similar time according to the profile but causes 8x slowdown, this part is not clear. 5. Table 4 shows the results for ZERO-DP+ when it OOMs. How does LongShield perform in terms of throughput and peak memory when ZERO-DP+ doesn't OOM. Can we utilize LongShield to improve throughput even when ZERO-DP+ doesn't OOM? Suggestions 1. In figure 1, mention the per-device batch-size and global batch-size. For example, in this particular figure, FSDP supports GBS of 2*MBS and a per-device batch-size of MBS. But with CP, we can support larger context length at the cost of per-device batch-size reduced to 1/2 and GBS of 1. This is not clear in the figure or in the paragraph below. 2. Figure 4 and text below is not consistent. The text says "we can all-to-all (A2A) exchange the activation tensor (shape changing from (MBS, T/2, p) to (MBS, T, p/2)) and then all-gather (AG) the activation gradient tensor (shape transferred from (MBS, T/2, d) to (MBS, T, d))." but the figure shows (MBS, T, p) and (MBS, T, d/2). Please correct it.	Fully human-written
LONGSHIELD: SCALABLE DISTRIBUTED DIFFERENTIALLY PRIVATE TRAINING FOR LONG-CONTEXT LLMS	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes LongShield, a system approach that combines context parallelism (CP) with existing differentially private (DP) training frameworks (e.g., Opacus + ZeRO-DP) to enable long-context DP training for LLMs. The key contributions include per-sample gradient sharding, communication overlap, and DP-compatible activation checkpointing. Experiments on Llama-3 models (1B–8B) show improved throughput and reduced memory over ZeRO-DP, scaling context length up to 64k tokens on 4×H100 GPUs. 1. Clear and reproducible system engineering. 2. Demonstrates that DP-SGD can scale to longer contexts using modest hardware. 3. Addresses practical compatibility issues (e.g., checkpointing with DP hooks). 4. Experimental evaluation on real Llama-3 models is thorough for throughput and memory. 1. Weak novelty. The method essentially combines context parallelism with DP-SGD under existing frameworks. It does not introduce new algorithms, optimizers, or privacy accounting techniques. 2. Lack of multi-dimensional parallelism. The system is confined to single-node CP. It does not explore or support other critical dimensions of LLM parallelism — such as Tensor Parallel (TP), Pipeline Parallel (PP), Expert Parallel (EP), or ZeRO data sharding. The paper even notes communication challenges would “become more critical” beyond one node, but never verifies cross-node scalability. Consequently, the proposed method’s scalability under realistic distributed setups remains unclear. 3. No integration with established frameworks. CP is already well-implemented in Megatron, which also provides integrated TP/PP/EP interfaces. The authors should compare their CP-DP implementation against a baseline that simply adds DP-SGD into Megatron. Moreover, they should explain why they did not directly integrate into Megatron — doing so would have allowed evaluation under real multi-dimensional parallelism (ZeRO + TP + PP + CP + EP) and would strengthen the engineering contribution. 4. Unconvincing motivation. The paper claims that DP is critical for long-context LLMs because “long sequences may contain sensitive data,” but provides no empirical evidence (e.g., no memorization or membership-inference study) to support that assumption. 5. No privacy evaluation. The paper lacks ε/δ reporting, attack-resistance analysis, or privacy–utility trade-off curves. As such, it demonstrates feasibility, not utility. 6. Limited scientific insight. Improvements (gradient sharding, hook fix, communication overlap) are incremental engineering optimizations that could be implemented in existing systems with minor effort. 1. What privacy budgets (ε, δ) were achieved in your experiments? 2. How does DP noise affect model accuracy or perplexity? 3. Have you compared your CP implementation with Megatron’s sequence-parallel engine? 4. Could LongShield be integrated with Megatron to combine TP/PP/CP/DP? 5. How does performance scale across nodes with slower interconnects (e.g., RDMA instead of NVLink)?	Fully AI-generated

PreviousPage 1 of 1 (4 total rows)Next