|
SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a new self-supervised distillation technique to unify speech and audio SSL models.
Using MVQ representations from two well-known SSL models (WavLM Large for speech and Dasheng 1.2B for audio), the authors train a ZipFormer on filterbanks to produce representations H that can effectively reconstruct the quantized representations from WavLM and Dasheng.
The SPEAR model, obtained by distilling SSL models in an SSL manner, is therefore an SSL model itself.
More than preserving the capabilities of both models, SPEAR outperforms them in most tasks from the SUPERB and HEAR benchmarks, respectively, for speech and audio.
The article is very clear, shows an interesting way of merging speech and audio representations.
The structure is linear, and the experiments show in great detail the comparison to existing models through known benchmarks.
A few weak points can be identified still:
This article proposes multiple factors of improvement: the combination of speech and audio teacher models, the use of MVQ representations to jointly represent them, and the use of WavLM and Dasheng as teachers, on an architecture different than the Wav2Vec2.0 framework (namely filterbanks + Zipformer).
First: One or multiple ablation studies would have been interesting, to better judge which aspects of the pipeline have the greatest impact.
Second: One of the main points of the article is to propose a new SSL model that outperforms WavLM-Large (which has been the top model on the SUPERB benchmark for some time now). However, it could be clarified that this model is actually distilling WavLM-Large during its SSL training.
About the ablation studies previously mentioned:
Did you try to replace your model by a Wav2Vec2.0 architecture?
Or did you tried to use the same L1 loss as in USAD, removing the MVQ, while keeping everything else similar? |
Fully human-written |
|
SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes SPEAR, a self-supervised model to unify speech and audio representations. This is achieved by using a masked token prediction loss between the student model and domain-specific teacher models. The paper achieves strong performance across speech and audio benchmarks such as SUPERB and HEAR.
The paper is written well and contains detailed experiments. The experimental analysis is comprehensive. The student model achieves strong performance on many tasks, even outperforming the teacher model, such as WavLM, on some tasks.
The paper has limited novelty. Multi-codebook vector quantization was proposed for knowledge distillation, and distilling from modality-specific teachers to build a unified model has been done before, e.g., USAD.
Line 15: “the first SSL framework to successfully learn unified speech and audio representations from a mixture of speech and audio data.” What makes it the first SSL framework when other methods, such as USAD (one of the main baselines of the paper), have proposed unified models to combine the audio and speech processing capabilities into a single model?
On the HEAR benchmark, Table 6, the Dasheng 0.6B model outperforms the SPEAR XLarge model in average performance across env, speech, and music tasks. For Env and music tasks, the Dasheng teacher-model is better, and on speech, SPEAR performs better. On the SUPERB benchmark, Table 5, the SPEAR-Large model matches the performance on the WavLM teacher.
Only after scaling the SPEAR model to 600M, the SPEAR model outperform the teacher WavLM model. For Audio tasks, even with scaling, the model does not seem to outperform the teacher Dasheng. Why does this discrepancy exist between the audio and speech domains?
In Table 5, why is the Dasheng model not included? It would be interesting to see how the representations extracted from the Dasheng perform on the SUPERB benchmark.
Given the strong performance of the Dasheng teacher model on the HEAR benchmark, do we really need to distill from domain-specific teacher models to build unified models, or is directly training a model with an SSL objective using both modalities sufficient? |
Fully human-written |
|
SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper proposes SPEAR, a unified self-supervised learning (SSL) framework for speech and general audio. The core idea is to replace coarse token targets (k-means) with MVQ tokens extracted from strong teacher models (WavLM-Large for speech and Dasheng-1.2B for general audio). And to train a Zipformer encoder by masked token prediction. A dual-domain asymmetrical loss encourages a joint representation space while weighting audio vs. speech targets during mixed training. Models are trained at 94M, 327M, and 600M scales on mixtures up to 197k hours; results are reported on LibriSpeech ASR, AudioSet audio tagging, SUPERB, and HEAR, with consistent gains vs. WavLM and USAD baselines.
1. **Originality.** Uses MVQ tokens from teacher models to supply fine-grained discrete supervision for both speech and audio, contrasting with k-means/RPQ in prior speech SSL, and introduces an asymmetrical dual-domain loss.
2. **Quality.** Thorough empirical study: LibriSpeech ASR (RNN-T and CTC), AudioSet tagging, SUPERB, HEAR; dual-domain gains and scaling trends; targeted ablations (teacher, codebooks, dual-domain strategy).
3. **Clarity.** MVQ formulation and encoding/decoding are clearly described. Zipformer specs and training settings are documented.
4. **Significance.** Strong performance across domains: AS-2M mAP 50. Consistent improvements vs. WavLM at matched scales on SUPERB.
1. **Compute transparency.** The paper lists steps and batch sizes but provides no hardware or GPU hours accounting for pre-training and fine-tuning. For a 600M dual-domain model on up to 197k hours, the compute cost should be reported.
2. **Attribution of gains.** While Appendix G ablates several factors, the main text does not clearly isolate the source of gains (teacher strength vs. MVQ vs. Zipformer vs. data scale). A controlled comparison using the same encoder and data but different quantization targets (MVQ vs. k-means/RPQ) under the same training loss would better attribute sources of improvement. Current k-means comparison is qualitative rather than a training baseline.
3. **Quantization comparisons.** MVQ vs. k-means is visualized in Appendix G.4, but there is no empirical head-to-head SPEAR-with-k-means (or RPQ) target training. Including such baselines would strengthen the claim that fine-grained MVQ targets are the primary driver.
4. **Teacher dependence and fairness.** The audio teacher (Dasheng-1.2B, 272k hours) is very large. Although USAD is compared, a more systematic study of teacher combinations/sizes (e.g., smaller audio and speech teachers) would clarify how much of the improvement is inherited from teachers vs. MVQ.
5. **Main paper balance.** Many key analyses (teacher choice, codebooks, loss balancing, dual-domain strategies) are relegated to the appendix, making it difficult for readers to assess mechanisms from the main narrative.
1. **Quantization approaches.** Please add direct baselines where SPEAR uses k-means or RPQ tokens (same teacher layers, same training loss, same data/steps) to quantify the specific benefit of MVQ. The paper already positions k-means/RPQ as common in speech SSL and visualizes MVQ vs. k-means subspaces. A training baseline would make the argument causal.
2. **Ablations to attribute gains.** Please surface in the main paper a compact set of ablations that answer: (a) teacher choice (HuBERT vs. WavLM for speech; ATST-Frame vs. Dasheng for audio), (b) codebook count $N$ (e.g., 4/8/16) and fixed $K=256$, and (c) dual-domain strategy (joint/disjoint/asymmetrical). These are in Appendix G. Summarizing them centrally would clarify the source of performance.
3. **Teacher combinations and scale.** USAD distills from smaller teachers; could you include experiments with smaller audio teachers to test whether MVQ still confers advantages when teacher capacity/data are reduced?
4. **Compute and resources.** Please report hardware (GPU/TPU type and count) and total pre-training steps $\times$ batch seconds, split by model scale (94M/327M/600M). This is essential given the scale of Mix-197k and the 600M model. The current paper lists steps and batch sizing, but not the computation.
5. **Objective sensitivity.** You tune $\alpha$ (masked vs. unmasked loss weight) and adopt $\lambda=0.1$ for the dual-domain term. Could you show sensitivity curves for $\lambda$ (and possibly per-domain $\alpha$) to establish robustness and to guide other researchers?
6. **Encoder architecture.** Zipformer is a design choice. A brief ablation (even at Base scale) comparing Zipformer vs. a Conformer/Transformer under SPEAR would help separate architectural benefits from MVQ/teacher effects.
7. **Quantizer training details.** You state $K=256$ for storage efficiency and vary $N$; please also report codebook training compute and whether using earlier/later teacher layers for MVQ training changes downstream results (you use 21st WavLM layer for speech; last layer for Dasheng). |
Fully AI-generated |
|
SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations |
Soundness: 2: fair
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
## An attempt to unify speech and audio SSL representations
* This paper proposes SPEAR, which is essentially a model for audio and speech representation learning trained through knowledge distillation from MVQ quantised targets generated from existing Speech (WavLM) and Audio (Dasheng) SSL models via a joint masked prediction objective.
* Three distinct configurations of SPEAR have been evaluated: audio-only, speech-only, and joint audio-speech settings, respectively, across the HEAR and the SUPERB benchmark suites.
* The paper is an excellent read: it's easy to follow and understand, and experiments across the main benchmark suites spanning speech and audio domains are commendable.
* The neural architecture employed is straightforward, and as someone who finds great strength in simplified architectures, it's a joy to see a straightforward approach pan out relatively well.
However, the paper has several weaknesses which are grounds for the given score.
## Point 1
The paper is not sufficiently well-motivated, in my opinion. Why do we need a unified front against speech and audio, and is masked discretised token prediction actually the correct way to go about it? Maybe the community has not tried to learn a unified SSL approach through prediction of discrete tokens because discretising fine-grained acoustic content (audio) and semantic/phonetic content (speech) in a single representation is tremendously difficult and always presents a performance tradeoff across the two aspects, as demonstrated by the large swath of audio tokenisation papers trying to crack this problem.
## Point 2
* The approach lacks novelty. It is essentially a model trained by masked prediction of quantised targets obtained from existing models (in this case, SSL approaches). EncodecMAE [1] learns an audio-only SSL representation in an extremely similar fashion, using Encodec [2] to generate the quantised targets instead, and it outperforms BEATs too. But there is no mention of EncodecMAE in the submitted paper.
* Even if we let the above omission slide, the proposed approach is simply not novel enough for an ICLR paper. The novelty is entirely in the application. MVQ, the two teacher models, distillation on quantised targets, and model architecture: they have all been done before. The dual objective framework is also novel only in application to audio and speech SSL unification.
## Point 3
Talking about BEATs, the submitted paper states that BEATs has substantial training complexity. That is true, BEATs has a 3-stage training process and indeed has high training costs. However, so does the proposed approach, albeit indirectly. The prerequisite for SPEAR is not one, but two frontier pretrained SSL models, and SPEAR needs to be trained again on top of it all on very large-scale datasets. On the other hand, BEATs does not require an existing large-scale SSL model; it requires iterative refinement because the tokenizer is, well, iteratively refined from cold-start. The commentary on simplicity and training complexity within the context of this paper is moot.
## Point 4
* Results for audio tasks (HEAR, Table 6) are not very encouraging.
* The paper states that "Finally, our dual-domain models consistently outperform their single-domain counterparts, highlighting the benefits of our unified pre-training framework". However, the majority of the (relatively small in proportion to the increase in pretraining dataset size) performance improvement in SPEAR-(a+s) over SPEAR-(a), for both Base and Large models, stems from improved speech performance, which is not surprising.
* Further, despite being distilled from Dasheng 1.2B, the base and large SPEAR-(a) and SPEAR-(a+s) models trail behind Dasheng-Base, the smallest Dasheng model, by quite a bit, despite having more parameters. The only case where SPEAR does better than Dasheng models is when using feature concatenation for the XLarge configuration. The presented analysis showcases no benefit to training a complex SPEAR Audio+Speech model for audio SSL.
* These observations tie in with Point 1. If the unification of audio and speech fronts was a significant novelty, and was an SSL approach pretrained from scratch, one could argue that future research will close the gap. But, given the worse audio representation learning performance, the merits of the SPEAR approach over the existing publicly available pretrained Dasheng models are not clear.
### Point 5
- All the models presented in Table 6 are trained on different amounts of data. This only obfuscates results. BEATs and ATST-Frame are both trained only on 5k hours of audio data. SPEAR models performing worse than Dasheng models could also be written-off by the size of the pretraining data used by Dasheng, but we just can't say based on the presented analysis.
## Overall
My recommendation to reject is a reflection of the above-mentioned shortcomings.
[1] https://www.isca-archive.org/interspeech_2025/pepino25_interspeech.html
[2] https://openreview.net/forum?id=ivCd8z8zR2
No direct questions, kindly address the points raised in the weaknesses section. |
Fully human-written |