ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (33%) 6.00 3.00 3953
Fully human-written 2 (67%) 5.00 3.00 4644
Total 3 (100%) 5.33 3.00 4414
Title Ratings Review Text EditLens Prediction
Physics-Informed Audio-Geometry-Grid Representation Learning for Universal Sound Source Localization Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents AGG-RL (Audio Geometry Grid Representation Learning), a family of algorithms to estimate direction-of-arrival in a way that is invariant to the geometry of the microphone array and adaptable to different grids. To achieve this, AGG-RL relies on two innovations: i) using a DFT where the width of the frequency bins is learnt alongside the training (allowing for higher resolution on the frequencies that are most relevant for direction-of-arrival estimation), and ii) use of relative position encoding to describe the microphone geometry. To train AGG-RL a network (AuGeonet) encodes the sound field (conditioned on the microphone array geometry), whereas another network (Gridnet) encodes a grid of query points where the sources in the sound-field may be localised. The final output logits are obtained by combining the outputs of AuGeonet and Gridnet (essentially an inner product), with each logic representing the probability of a source being present at the corresponding query point. Binary cross-entropy is used to compute the loss. Evaluations on the LOCATA dataset show AGG-RL achieving ~11° mean-absolute error on direction-of-arrival estimation on a microphone array and dataset not seen during training. * The paper is well written and easy to follow (though see nitpicks below for possible improvements). * The contributions (learnable DFT widths, relative positions for microphone array embeddings, and flexible grids) are elegant, inspired by physics, and well described in the manuscript. * The paper contains a useful introduction to direction-of-arrival estimation, allowing non-acoustics-experts to follow the paper. * The evaluation ablates all the components of AGG-RL and presents very strong results. * Device-independent direction-of-arrival estimation is an important problem for acoustics: many consumer devices nowadays need to perform direction-of-arrival estimation, but each tends to have their own configuration. A single network that generalises over all devices is genuinely useful. * **W1**: Some details of the training and evaluation procedures are missing in the manuscript (see questions below) * **W2**: More ablations are needed to fully characterise the behaviour of AGG-RL (eg. SNR, number of sources in the sound-field, see questions below for specifics). * **W3**: Evaluating against other datasets with different microphone array geometries would help definitely establish generalisation to any microphone array. _Overall_, this is a strong submission held back by a relatively short evaluation that leaves many questions unanswered. It is my hope many of these issues are resolved during rebuttal. * **Q1**: In eq 14, are the grid points $\\theta_d$ and $\\phi_d$ obtained by linearly spacing the azimuth and elevation ranges by $\\frac{G}{2}$ points respectively? * **Q2**: How many synthetic samples were used during training? Can the performance of AGG-RL be further improved by adding further training samples? * **Q3** Appendix A.9 contains really important information that it is not hinted at in the main text (acoustics characteristics of the simulations, utterances and noise datasets, SNRs, duration of the audio segments) that should be summarised in the main text. * **Q4**: What does it mean exactly that the NAO dataset has training exposure? Was the dataset used during training? * **Q5**: In table 2, could you report the spread of the mean-absolute-error of the direction-of-arrival errors? * **Q6**: How do you train AuGeonet to accept more/fewer microphones in an array? Is there a "not present" encoding? * **Q7**: Could you breakdown the results in table 2 by number of speakers present in the sound-field? * **Q8**: Could you breakdown the results in table 2 by SNR and T60? * **Q9**: Can AGG-RL model different directional sensitivities in microphones? If not, could you hypothesise if this would be possible with an extension? Relatedly, would it be possible to input ideal ambisonics to AGG-RL? * **Q10**: What microphone array geometry information is provided to AuGeonet? The main manuscript seems to indicate it is the relative encodings just the relative angles, but Appendix A.3 indicates the geometry is given in cartesian and polar coordinates. * **Q11**: Could you evaluate AGG-RL against other datasets for direction-of-arrival estimation in the literature (eg. TUT Sound Events 2018[1], Spatial LibriSpeech [2], STARSS'23 [3]? * **Q12**: In table 11, why does the performance decay as the number of points in the grid increases. ----- ### Nitpicks (do not affect rating, no need to follow up on these during rebuttal) * **N1**: I found the notation for $\bf{v}$ very confusing and for a few minutes could not understand what $\bf{v}_M$ and $\bf{v}_G$ were. These would be easier to parse if they were expressed as a function $v(Q) = \frac{4}{Q}\[0, 1, \dots, \frac{Q}{4} -1\]$ * **N2**: Similarly I find it would be easier if you indicated the function parameters in equations 12, 14,15 and 16. * **N3**: I would suggest explicitly mentioning in equation 1 that $c, k, l$ refer to the channel, frequency, and time indices respectively. ------- [1] Adavanne et al. (2019) “Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks”. Journal of Selected Topics in Signal Processing. [2] Sarabia et al. (2023) “Spatial LibriSpeech: An Augmented Dataset for Spatial Audio Learning” Interspeech. [3] Shimada et al. (2023) "STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events" NeurIPS Track on Datasets and Benchmarks. Fully human-written
Physics-Informed Audio-Geometry-Grid Representation Learning for Universal Sound Source Localization Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces the Audio-Geometry-Grid Representation Learning (AGG-RL) framework designed to achieve universal sound source localization (SSL) that is flexible regarding Direction-of-Arrival (DOA) grids and invariant to microphone array (MA) geometries. This is achieved by learning shared latent representations for audio-geometry information and DOA grid candidates, allowing the model to predict a probabilistic spatial spectrum without requiring retraining for new grid resolutions or MA configurations. The primary technical contributions include two physics-informed components integrated into the framework: a Learnable Non-uniform Discrete Fourier Transform (LNuDFT), which adaptively allocates frequency bins to emphasize informative phase cues, and a relative Microphone Positional Encoding (rMPE), which encodes microphone coordinates relative to a reference channel, aligning with the nature of inter-channel time differences (TDOA). Experiments on synthetic and real datasets demonstrated that AGG-RL achieves better performance and stronger generalization to unseen microphone geometries and environments where traditional methods struggle. - The paper integrates two novel, physics-informed components to enhance generalizability and interpretability. The LNuDFT optimizes frequency bin allocation to emphasize physically informative phase cues (IPDs) crucial for SSL. The rMPE encodes microphone coordinates relatively and improves generalization to unseen MA configurations. - AGG-RL demonstrates better performance and robustness across experiments, showing strong generalization to unseen MA geometries, unseen numbers of channels (Dynamic-U), and real-world recordings. - I like the visualization of the interpretable probabilistic spatial spectrum over candidate DOAs. It confirms that the proposed method yields sharper, more stable, and more distinct peaks closely aligned with ground-truth DOAs compared to baselines, even under unseen geometries. - The paper is well-written, and I really enjoy reading it. - The concept of the NuDFT itself is not new, having been explored in prior signal processing literature since the 1990s. The novelty lies in making the frequency bin locations $\nu_k$ learnable parameters within a deep neural network. While effective and physics-informed, this is an adaptation and optimization of an existing signal processing technique to maximize the extraction of IPDs. - While the logit-based initialization strategy for LNuDFT proved effective in practice and benefited generalization to unseen conditions, the choice of mapping function and hyperparameters was made empirically - Could the authors elaborate on the empirical tuning process that led to the specific values chosen $\epsilon_{start}=0.15, \epsilon_{end}=0.95, \epsilon_{min}=0.01, \epsilon_{max} =100$, and how sensitive the final performance is to small variations in these constraints? - While AGG-RL achieved good performance in general, the results still indicated a performance gap between seen and unseen conditions. What specific architecture or training configurations do the authors believe offer the most potential to further minimize this performance gap under diverse, unseen MA geometries? - In the ablation study, experiment (vi) (Fixed Grid) achieved the best performance on the Dynamic-S dataset (seen training conditions), and surprisingly, slightly outperformed the proposed AGG-RL method on the Dynamic-U dataset (unseen channels) in terms of ACC10. Why did the AGG-RL framework exhibit a degradation compared to the fixed-grid approach in this specific unseen synthetic channel configuration? Fully human-written
Physics-Informed Audio-Geometry-Grid Representation Learning for Universal Sound Source Localization Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces AGG-RL, a novel framework for sound source localization (SSL). It aims to achieve SSL with geometric invariance and grid flexibility by jointly learning audio-geometric representations and grid representations in a shared latent space. To achieve this, the approach proposes a learnable non-uniform discrete Fourier transform (LNuDFT) that assigns frequency bins based on physical informativeness, and a relative microphone position encoding (rMPE) aligned with TDOA. Experiments were conducted on synthetic and real datasets, demonstrating improved performance over baselines. - **Originality**: The integration of physics-informed design (LNuDFT, rMPE) with a unified latent space for audio, geometry, and grid is original and is well-motivated by known limitations of geometry-specific and fixed-grid methods in SSL. - **Quality**: The experimental design is thorough: ablation studies, varied datasets (including real recordings), and consistent baselines provide credible evidence for claims of generalization and robustness. The efficiency analysis (parameter count, FLOPs) is also useful. - **Clarity**: The paper is mostly well written, with helpful architectural diagrams and clear mathematical formulations. The method and its motivation are explained coherently, and appendices include implementation details and sample code links. - **Significance**: The experimental results demonstrate that this framework provides substantial improvements to SSL in real-world environments where array and grid conditions may vary. 1. **(Lack of justification for the claim on being physics-informed)** The paper mentions LNuDFT and rMPE as two main components proposed as ‘physics-informed components’ in the abstract, but the reason why these are ‘physics-informed’ was not explained. - Although Appendix A.1 supports that LNuDFT's ‘trainable manner’ emphasizes informative phase regions/causes to some extent, but the way of drawing the claim that 'it emphasizes informative regions' was limited to the common sense about speech signals. (For example, would the frequency response of LNuDFT parameters become different for the other signals, e.g., bird chirps or ambient wind noise etc?) - Similarly, for rMPE, it was not demonstrated that using this method actually captures “inter-channel time differences” more effectively. Performing DFT non-uniformly and encoding information relatively can be done independently of physics; connecting it to physical phenomena seems to require more detailed justification. 2. **(Difficulty in result analysis)** While it is commendable that comparisons were made on both real and synthetic datasets, the fact that all microphone arrays in the synthetic data were selected as dynamic microphones complicates result analysis. - Particularly, examining the results in Table 3 reveals a noticeable trend of performance differences, but it is unclear whether this stems from differences between real and synthetic data or from differences in dynamic microphones. The authors also did not draw a clear conclusion on this point. 1. Experiment (v) still outperforms the proposed one on the Dynamic-U data and performs on par with the proposed one for NAO robot, but significantly degrades on the Eigenmike. Please expand on this discrepancy? 2. In L70, “Both components facilitate the extraction of spatial representations with physics-based inductive biases”: Why are LNuDFT and rMPE considered as imposing (physics-based) inductive biases? To me, these are perceived as a process that introduces and makes learnable parameters for feature extraction more flexible, and I see it as a process that relieves the inductive bias. 3. Can the LNuDFT initialization or update scheme get stuck in poor local minima, e.g., if the initial frequency allocation is far from optimal? Have the authors tried more physically motivated or data-driven initializations beyond logit-based mapping? Lightly AI-edited
PreviousPage 1 of 1 (3 total rows)Next