|
FiGuRO - Intrinsic Dimension Estimation for Multi-Modal Data |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces FiGuRO (Fidelity-Guided Rank Optimization), a framework for intrinsic dimension (ID) estimation in unimodal and multimodal data. FiGuRO combines rate-distortion theory with low-rank projections that adaptively adjust latent dimensionality based on reconstruction fidelity. It separates shared and private subspaces across modalities, providing interpretable and efficient ID estimation. Experiments on synthetic and real data demonstrate accurate and stable results outperforming existing baselines.
1. FiGuRO combines rate-distortion theory with adaptive low-rank learning, providing a principled and theoretically sound approach for intrinsic dimension estimation in multimodal data.
2. The method is evaluated on various synthetic and real datasets, covering scalar, image, and temporal modalities, and show improvement over baselines in both stability and accuracy.
3. The idea that the dimensions of different modalities might differ is realistic and novel, providing significance for real-world applications.
4. The framework can be implemented within standard autoencoders. This simplicity makes FiGuRO highly practical and broadly applicable across different data types and model settings, enhancing its potential for adoption in real-world multimodal learning pipelines.
1. Some implementation details (e.g., how the distortion budget $\gamma$ or threshold $\lambda$ is set) are not fully discussed, which may affect reproducibility. Also, the sensitivity regarding these hyperparameters is not evaluated.
Is the method sensitive to the hyperparameters? And is there any experimental validation in this aspect? |
Fully AI-generated |
|
FiGuRO - Intrinsic Dimension Estimation for Multi-Modal Data |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper focuses on the problem of learning latent representations for multi-modal data. For such a problem, determining the intrinsic dimension is a crucial problem. The authors propose a method to adaptively adjust the intrinsic dimension when learning multi-modal autoencoders. In specific, they multiply the latent representation with a weight matrix $W$, and apply SVD on $W$ every several epochs. The criteria for choosing rank is the reconstruction fidelity. If it is increased, the rank is reduced, otherwise increasing. This rank adaptation process is performed after pre-training of the AE, until convergence of the learned rank. The authors conduct experiments on several simulation datasets and two real datasets to show that the proposed model is able to learn the intrinsic dimensions.
The method is simple to use and has potential in many AE architectures, which could be an essential component in different ML models.
1. Lack of strong baselines. The authors compare with several baselines in simulation, but not for real-world data analysis. For results in Table 2, the authors list the range of estimation varying hyperparameters. The proposed method has the smallest and closest range. But could this be because of inappropriate choices of other methods?
2. In experiments, the authors mainly show the results for estimating the rank. But it is unclear what the practical advantage is, e.g., reconstruction performance or interpretability.
3. Although the authors claim that the proposed method targets multi-modal data, I don’t feel there are specific designs for multi-modal settings. The rank selection procedure is applied on each modality sequentially. Moreover, applying the method on multi-modal settings may require choosing different fidelity measurements for different data domains.
1. As said in the paper, ARR is one very related paper. ARR applies SVD on latent representations, while this work applies SVD on a projection matrix. What is the motivation of this difference?
2. Did the authors consider adding sparse regularization during either pre-training or rank optimization? Would it help the rank learning? |
Fully human-written |
|
FiGuRO - Intrinsic Dimension Estimation for Multi-Modal Data |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors propose a method to estimate the intrinsic multimodal data dimension: shared and source-specific. To achieve this, they rely on paired autoencoders and iteratively adapt their latent dimension size by learning a low-dimensional projection. Experiments on the simulated and real data show utility of the method.
1. **Novel methodology.** The proposed method of greedily adapting the latent dimension rank (rate) to achieve the desired error (distortion) is novel and appears justified.
2. **Clear motivation and writing.** The paper is mostly clear and well-explained. The problem is well-motivated.
1. **Lack of comparisons with multi-view data decomposition methods.** On line 288, the authors claim lack of "multi-modal ID estimation techniques," which is not strictly speaking true. There a vast literature on the multi-view data decomposition with canonical methods such as CCA [1], JIVE [2], AJIVE [3], DIVAS [4], and PPD [5]. While these methods assume a specific linear model of joint and individual mixing, I believe they provide a strong baseline that any non-linear methods must compare with. Moreover, unlike the current method which only estimates the joint and individual ranks, the methods [1-5] also produce estimates of the joint and individual subspaces (the spanning sets). I believe this a critical omission in the current work.
2. As the original **motivating examples**, which are plausible, the authors list the following points why ID estimation is useful:
> (i) performance of deep neural networks has been shown to depend on the intrinsic rather than ambient dimension
> (ii) need interpretable models
> (iii) want to know whether expensive or difficult-to-obtain modalities are relevant
None of them are sufficiently illustrated in the experiments section. The authors apply their method, obtain the ID of each data source, and then check if it agrees with their intuition.
3. **Corner (degenerate) cases.** I feel the method crucially assumes that $k_s > 0$. In practice, the practitioners may be interested in applying the method exactly under the setup, where the dependence (shared) signal between the modalities is uncertain. However, from the examination of the algorithm steps, I believe the method would fail to converge in such case.
The exact setup I have in mind is supposing, we have Assumptions 1-4 hold: oracle autoencoders and correct data generating mechanism. But further suppose that $k_s = 0, k_1 = 10, k_2 = 10$, i.e., there is no joint. Going through the steps in Algorithm 1, because the updates to the individual / joint ranks are combined together from Eqn. 4 and Lines 13-14 in the algorithm, the problem becomes unidentifiable. I can see two distinct fixed point between which the algorithm is going to oscillate, estimate the joint to be rank $=20$ or two individual of rank $=10$ each. Namely, I don't see any mechanism by which the algorithm is supposed to correctly identify which option to prefer.
---
[1] Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 321–377.
[2] Lock, Eric F., et al. "Joint and individual variation explained (JIVE) for integrated analysis of multiple data types." The annals of applied statistics 7.1 (2013): 523.
[3] Feng, Qing, et al. "Angle-based joint and individual variation explained." Journal of multivariate analysis 166 (2018): 241-265.
[4] Prothero, Jack, et al. "Data integration via analysis of subspaces (DIVAS)." TEST 33.3 (2024): 633-674.
[5] Sergazinov, Renat, Armeen Taeb, and Irina Gaynanova. "A spectral method for multi-view subspace learning using the product of projections." arXiv preprint arXiv:2410.19125 (2024).
1. What's the purpose of $k_{min}/k_{max}$, I don't see them being used anywhere in the algorithm.
2. On line 17 in the algorithm, it says "set ranks to their previous values". What are the initial values for these? Let's say the algorithm just started, then supposedly, the ranks have to be initialised already?
3. There are few typos: (i) on 188 "perform do" (ii) line 469 "Alltogether". The paper needs to be proof read. |
Fully human-written |
|
FiGuRO - Intrinsic Dimension Estimation for Multi-Modal Data |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
Previous dimension estimation methods usually use a unified and static dimension for the representation of the whole dataset, or contrastively estimate for other datasets. This is ill-used for multi-modality data. With this motivation, this paper proposes FiGuRO, which estimates the shared dimension and private dimension for each modality.
1. Emphasize the ill-used unified dimension for multi-modality data.
2. Propose FiGuRO, which uses SVD to estimate the dimension and is constrained by the Rate-Distortion Theory.
1. Some notion is not clear. In equations (3) and (4), SVD is used to estimate the intrinsic dimension. However, this paper did not describe how to estimate the shared dimension $k_s$ and private dimension $k_1$. Instead, the low-rank weight matrices are directly used in equation (4).
2. Experiments are mostly designed in simulation. In real-world data, a good dimension will result in better performance in downstream tasks. More evaluation on downstream tasks will be helpful.
3. The SVD and Rate-Distortion Theory are both commonly used for intrinsic dimension; the method only replicates SVD for multimodality data.
1. For the loss function, if only R(D) is used?
2. If the hyperparameter distortion threshold $\lambda$ has a special meaning, why only analyze this hyperparameter in Figure 2?
3. In Figure 2, the evaluation metrics accuracy and goodness of R are clear, how to read the rank in Figure 2A? If close to the ground truth (dashed lines) indicates better performance? |
Fully human-written |