ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (50%) 3.00 3.50 4356
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (25%) 4.00 4.00 2660
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 1 (25%) 4.00 4.00 3965
Total 4 (100%) 3.50 3.75 3834
Title Ratings Review Text EditLens Prediction
Spoken Named Entity Localization as a Dense Prediction task: End-to-end Frame-Wise Entity Detection Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents DEnSNEL, the first fully end-to-end and text-free model for frame-level spoken Named Entity Localization (NEL). Instead of cascading ASR and NER or augmenting CTC with entity tokens, DEnSNEL reformulates NEL as a dense binary classification task over 20 ms audio frames. A lightweight encoder–classifier processes mel-spectrogram frames in parallel with a learnable complex Gabor filterbank that operates directly on the raw waveform to capture phonetic transients; the two streams are concatenated and fed to a 3-layer MLP. Training is driven by the proposed Temporal Boundary Alignment Loss (TBAL), which combines frame-wise BCE with a 1-D IoU term to sharpen entity boundaries. On the SLUE Phase-2 benchmark, DEnSNEL achieves 78.4 frame-F1 with a Whisper-medium encoder, outperforming the previous best result by +5.8 F1 while using fewer parameters. The paper’s key strength lies in its elegant reformulation of spoken named-entity localization as a fully end-to-end, frame-level binary prediction task, eliminating any reliance on intermediate text or CTC decoding. By equipping a lightweight encoder with a learnable complex Gabor filterbank, the model captures millisecond-scale phonetic transients that mel-spectrograms smooth away, yielding a +2.3 F1 boost with only 8 k extra parameters. Coupled with the proposed Temporal Boundary Alignment Loss, this design achieves a new state-of-the-art 78.4 frame-F1 on SLUE Phase-2 while running in ≈0.1 s on a single GPU, demonstrating that high accuracy and on-device efficiency can be obtained simultaneously. A notable architectural weakness is that the phonetic Gabor features and the contextual encoder outputs are simply concatenated without any channel-wise normalization or learnable projection; the raw Gabor magnitudes can be one to two orders of magnitude larger than the log-mel/logit features, so the subsequent layer-norm only rescales per-frame vectors and does not eliminate the risk of gradient dominance or sensitivity to the number of filters. This scale mismatch, omitted in Figure 2, could explain the saturation (and even drop) in F1 when more than 512 filters are used and should be addressed by separate layer-norm or bottleneck fusion blocks to ensure stable and interpretable feature blending. 1. the phonetic Gabor features and the contextual encoder outputs are simply concatenated without any channel-wise normalization or learnable projection; the raw Gabor magnitudes can be one to two orders of magnitude larger than the log-mel/logit features 2. No statistical significance tests are reported: the +5.8 F1 gain is quoted from a single run, and variance across random seeds or bootstrap resampling is unknown. 3. Hyper-parameter sensitivity (TBAL weight β, Gabor filter count K, threshold τ) is only explored in isolation; joint tuning and confidence intervals are absent. 4. All experiments are conducted on English data; cross-lingual or cross-domain generalisation (e.g., French, clinical speech) is not examined. 5. The promised code and pre-trained models are not yet available, impeding reproducibility and fair comparison by future work. Fully AI-generated
Spoken Named Entity Localization as a Dense Prediction task: End-to-end Frame-Wise Entity Detection Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The proposed DEnSNEL reconstructs spoken NEL into a binary classification task, determining ``entity/non-entity'' on the frame sequence and outputting results directly at the audio frame level. This approach fuses a pre-trained audio encoder with a learnable filter to enhance phoneme boundary cues. It also proposes the TBAL loss, a combination of the tIoU loss and the BCE loss, to simultaneously optimize semantic correctness and temporal accuracy in classification. It achieved state-of-the-art results on NEL in SLUE Phase-2. The proposed TBAL loss, a combination of tIoU loss and BCE loss, simultaneously considers the semantic correctness of NEL and the temporal precision of boundary prediction. The model achieves state-of-the-art performance on the SLUE Phase-2 NEL benchmark, while using fewer parameters compared to previous approaches. * For the second contribution "Phonetic-enhanced feature integration for precise entity boundary detection", the paper lacks a complete and detailed ablation study. The paper only compares the performance score with previous methods, providing coarse-grained evidence of the model’s effectiveness, but lacks fine-grained analysis to validate the contribution of the proposed filter. It is recommended to include ablation experiments that remove the filter and conduct deeper validation across different loss functions, such as:(1) Whisper-encoder + IoU-loss,(2) Whisper-encoder + BCE-loss,(3) Whisper-encoder + TBAL-loss. * Lacking in-depth analysis. It is recommended to further analyze the proposed method, such as how it compares to previous methods in different audio lengths, how sensitive it is to environmental noise, etc. Alternatively, the effect of adding filters to the previous task implementation method could be added. * Many important details are unclear. For example, The ablation experiments on TBAL-loss in the paper demonstrate that neither tIoU-loss nor BCE-loss alone are optimal. However, the paper does not provide a detailed explanation of the $\beta$ hyperparameter in TBAL-loss. Specifically, it is unclear which loss should be favored during actual training, or whether there are any trends when setting the two losses at different ratios. We recommend providing more comprehensive hyperparameter settings, along with reasonable experiments and explanations. The paper also does not provide a detailed rationale for fine-tuning only the deeper 1/6 layers of the audio-encoder during training. Is there a better choice for this ratio, or is there a trend? We recommend providing additional examples of other ratios, as well as two boundary conditions: full fine-tuning and full freezing. The following questions mainly come from weakness. I hope the author will explain them in detail in the rebuttal stage. * Q1: For the second contribution of the paper, the authors emphasize the importance of "Phonetic-enhanced feature integration," but no in-depth ablation analysis is conducted in the experiment. What is the performance of the model after removing the filter component? * Q2: In section 3.4, the authors proposed TBAL-loss, which combines tIoU-loss and BCE-loss, and conducted an ablation experiment on loss in the experimental section. It can be seen that there are differences in the effects of the two losses. However, the authors did not explain the $\beta$ parameter in TBAL-loss. In actual training, are the two treated equally or will one be biased towards the other? At the same time, is there any trend change with different $\beta$? * Q3: In section 4.2, the authors mention "maintain an approximately constant ratio of 1:6 between fine-tuned and total encoder layers across all variants." How is this figure of 1:6 obtained? Is there a more optimal option? Are there any trend changes in different ratios? What are the effects for full freezing and full fine-tuning? * Q4: How the model performs under different audio lengths? How it performs when faced with noisy audio? Fully human-written
Spoken Named Entity Localization as a Dense Prediction task: End-to-end Frame-Wise Entity Detection Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The manuscript proposes DEnSNEL, an end-to-end system for spoken named entity localization that predicts entity presence directly at the audio frame level, bypassing intermediate text representations common to cascaded ASR–NER pipelines. The method reframes NEL as dense binary classification ("entity" vs. "non-entity"), using a lightweight encoder–classifier augmented with a learnable complex filter bank to capture phonetic cues crucial for accurate span boundaries. Training is guided by a temporal boundary alignment loss (TBAL) that combines binary cross-entropy with a temporal IoU term to explicitly optimize boundary precision. On the SLUE Phase 2 benchmark, DEnSNEL reportedly surpasses state-of-the-art frame-level localization performance while using substantially fewer parameters. - lightweight architecture - learnable complex filter bank + a boundary-focused loss for span precision - evaluation is limited - missing practitioner baselines ## Practitioner baselines > DEnSNEL achieves state-of-the-art performance on the SLUE Phase 2 benchmark (Shon et al., 2023) for spoken NEL. > BLAB (Ahia et al., 2025) introduces a challenging long-form audio benchmark, reporting results using both open-source and proprietary audio LMs, including Gemini 2.0 Pro and GPT-4o. This differs from our motivation, which focuses on lightweight, on-device named entity detection and masking prior to sending audio to those large MLLMs. The manuscript argues for a different goal—lightweight, on-device NER masking prior to sending audio to large MLLMs. That’s a possible deployment case, but it does not totally remove the need to benchmark against what practitioners / real-world products actually run today. 1. There is no head-to-head comparison with a Whisper/WhisperX (or NeMo) alignment + strong text NER pipeline on **the same SLUE Phase-2 split**. This is the de-facto production baseline for timestamped NER; without it, it’s unclear whether DEnSNEL’s frame-level approach is competitive in accuracy and efficiency. 2. Promptable audio-LLM baseline. There is no comparison with prompted audio LLMs (e.g., GPT-4o, Gemini, Qwen-Audio) instructed to return entity spans with timestamps. Even if the objectives differ, practitioners weigh trade-offs across these options. The BLAB discussion’s claim that MLLMs are “not comparable” is insufficient for a 2025 audience; the field needs calibrated, side-by-side results. 3. Trade-off vs. audio-LLMs: For GPT-4o/Gemini/Qwen-Audio prompted for timestamped NER on SLUE, what is the **quality/latency/cost trade-off**, and in which regimes (device class, latency budget, privacy precision) does DEnSNEL win or lose? ## Limited Evaluation > Our evaluation does not correspond to BLAB-MINI, the short-audio subset of BLAB with audio segments under 30 seconds, which is not given in detail. Instead, we process long recordings in 30-second chunks, resulting in a hybrid evaluation setting. > The entity categories in BLAB (Event, Location, NORP, Organization, Person, TV Show, Temporal, and Work of Art) do not fully correspond to those used in our approach. We therefore restrict our analysis to the 49 recordings in the “All Entities” category, which contains all nine entity types, while the remaining files consist of single-category recordings. 4. BLAB comparison might not be apples-to-apples. The “zero-shot BLAB” analysis might be methodologically misaligned on at least two fronts: 4.1 Segmentation protocol. BLAB-MINI is short-audio (<30s), whereas the paper processes long recordings in 30-second chunks. This creates a hybrid evaluation regime that may alter boundary distributions, pause statistics, and error modes. Any reported performance should be prefaced with an explicit caveat and, ideally, replicated under BLAB’s native segmentation. 4.2 BLAB’s categories might not fully align with the paper’s entity set. Restricting analysis to the "All Entities" subset partially mitigates this but also changes the underlying class priors and may inflate or deflate difficulty. Without a principled mapping or macro-averaged reporting across matched classes, the comparison does not constitute a fair head-to-head. The authors should consider providing a deterministic mapping between BLAB and your entity set (one-to-one, one-to-many, or "dropped" classes). Report macro- and micro-averaged metrics on the intersection, plus per-class breakdowns to show where performance shifts. 5. The empirical evidence is concentrated almost entirely on SLUE Phase 2. With only one primary benchmark, it is difficult to disentangle overfitting to dataset artifacts from genuine generalization. External validity across accents, domains, recording conditions, and entity taxonomies is therefore not established. ## Potential Leakage > We use the Whisper encoder (Radford et al., 2023) as the audio encoder in our DEnSNEL architecture. 6. Are pretrained encoders (e.g., Whisper encoder) exposed to SLUE/BLAB audio or transcripts during pretraining? Evidence of deduplication or near-duplicate removal? > We evaluate our approach on the SLUE-VoxPopuli NEL test set from the SLUE Phase 2 benchmark (Shon et al., 2023). As no official training data is available, we construct our own training dataset by augmenting the SLUE-VoxPopuli NER training set (Shon et al., 2022) with word-level timestamps. 7. How could you avoid data leakage when constructing your own training dataset? If there is not any standard training/eval/test split, how could you make sure your comparison is fair enough? Fully AI-generated
Spoken Named Entity Localization as a Dense Prediction task: End-to-end Frame-Wise Entity Detection Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces DEnSNEL (Dense End-to-end Spoken Named Entity Localizer), a novel approach for frame-level named entity localization in speech audio. The method reformulates the task as dense binary classification (entity vs. non-entity) rather than text-based sequence labeling, employing a lightweight encoder with a learnable complex filter bank for phonetic features and a boundary-focused loss function. On the SLUE Phase 2 benchmark, DEnSNEL achieves state-of-the-art results (78.4 F1). 1. Novel Problem Formulation: The reformulation of spoken NEL as frame-wise dense prediction without intermediate text representations is innovative and well-motivated for privacy applications. This direct audio-to-mask approach is more suitable for on-device processing. 2. Technical Contributions: The learnable complex filter bank with jointly learned center frequencies and Q-factors is a solid extension of prior work (SincNet, Gabor filters) The temporal boundary alignment loss (TBAL) combining BCE and temporal IoU is intuitive and effective The integration of phonetic and contextual features is well-designed 3. Strong Empirical Results: State-of-the-art performance on SLUE Phase 2 (+5.8 F1 over previous best) 1. Limited Scope: The privacy motivation is compelling but insufficiently validated. Real PII masking requires entity-type–aware classification (e.g., masking PERSON but not ORG), which this binary approach cannot achieve. This limitation should be discussed and addressed in greater depth in the paper. I believe this limitation limits the contribution and applicability of this work. 2. Generalization Concerns: It remains unclear whether a model trained to detect a fixed set of entity types can generalize to unseen categories, such as works of art, not encountered during training. This is a realistic and important scenario for practical deployment. 3. Single-Source Evaluation: The reliance on a single dataset for evaluation is a fundamental weakness. The BLAB test set is relatively small, limiting generalizability. The paper would be stronger if the authors included evaluation on an independently annotated dataset or another NER corpus (e.g., SLURP) with alignments generated via MFA. Incorporating some form of zero-shot evaluation would significantly strengthen the paper’s claims about model applicability. 4. Incomplete Ablation Analysis: The phonetic filter ablation (Section 4.4, Figure 6) only varies the number of filters. A crucial missing ablation is performance without any phonetic features. Including this comparison would help clarify the true contribution of phonetic information. Please check weaknesses Moderately AI-edited
PreviousPage 1 of 1 (4 total rows)Next