|
Myna: Masking-Based Contrastive Learning of Musical Representations |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes MYNA, an efficient, masking-based contrastive learning framework for musical representation. It replaces traditional music augmentations with a high-rate token masking (90%) on mel-spectrograms. This heavy masking significantly reduces the number of input tokens, allowing for large batch sizes (4096) on a single GPU, and achieves an efficiency gain over prior contrastive methods. The authors also introduce a hybrid patching scheme (combining vertical and square patches) to capture complementary features (general purpose vs. pitch structure). The model is pretrained on the public AudioSet music subset. Myna achieves competitive performance with larger private models and establishes a new public-data SOTA.
1. The mask-only approach is simple and allows single-GPU large-batch training (batch size 4096), which translates to an 85x increase in efficiency over traditional contrastive methods like CLMR. The model achieves competitive average scores (68.6 for Myna-Hybrid) with MERT-95M, and surpasses public baselines like MERT-95M-public and MULE.
2. The hybrid patch design improves key detection (achieving SOTA among self-supervised methods) by integrating frequency-sensitive vertical patches. The method retains pitch sensitivity by avoiding traditional data augmentations (e.g., pitch shifts), which is beneficial for tasks like key detection.
1. Table 1 mixes public and private data baselines (e.g., MERT-330M) without transparently clarifying the training resource budgets.
2. The claim that "90% masking performs best" is not strongly supported by Figure 4. This is due to two issues: (a) Performance differences across high masking ratios looks marginal and lack verification of statistical significance; (b) The "average across all four benchmarks" curve can be mathematically unrigorous as it combines different metrics from different tasks.
3. The model's poor performance on EmoMusic is attributed to short clip length, a hypothesis that needs empirical verification.
1. It would be helpful if Table 1 were explicitly partitioned to clearly distinguish models trained on public data from those trained on private or internal corpora.
2. Could you provide supplementary figures showing the performance curves across different masking ratios for each of the four downstream tasks (MTT, GiantSteps, GTZAN, and EmoMusic)? |
Lightly AI-edited |
|
Myna: Masking-Based Contrastive Learning of Musical Representations |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
In this paper, the authors proposes a new training method for the representation learning of music audio. The proposed method includes aggressive input masking that seems to allow avoiding pitch shifting as an augmentation step, making the model aware of pitch and key information. The experiment showed that even without finetuning on the downstream task's training set and despite its smaller size & training data, Myna outperforms many other methods.
- Good performance
- Parameter-efficient
- Trained on a public dataset only
- The proposed method is simple
- Limited novelty: Some core changes such as using ViT and masked autoencoder are already proposed in other, similar work including audio domain.
- Although the performance is strong, the margin is rather reasonable, not outstanding.
- I don't think we should call the used audio processor as a "tokenizer", no matter how the word is over-subscribed in the community. It does not tokenize (making the input a discrete representation) at all, and it's even worse because some architectures indeed discretize the input audio. |
Fully human-written |
|
Myna: Masking-Based Contrastive Learning of Musical Representations |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper focuses on music representation learning. It follows a contrastive learning framework, with its main contributions being the use of a Vision Transformer (ViT) as the backbone model and the application of token masking. Furthermore, considering the characteristics of music analysis, the authors extend the approach into a hybrid model that incorporates vertical filters to better capture the frequency-related features of spectrograms. Through this relatively simple training strategy, the proposed model achieves competitive performance on several downstream tasks compared to models that require more than 5× the training time and parameters. Overall, the paper is well written and presents a solid contribution to efficient representation learning for music.
The proposed use of ViT and token masking seems promising in music representation learning.
The paper is easy-to-read and the illustration of the proposed method, experimental design, and results seems promising.
The proposed method seems to be only applicable to the clip-level MIR tasks. I wonder the authors opinion (maybe discussions) on how the proposed architecture can be applied towards frame-level tasks as well.
I wonder the effect of the patch size variations on performances. For example, would 4x4, 96x2, 128x3, 32x32, hybrid of them, etc these kind of diverse patch size affects the performance? |
Fully human-written |