ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 4.00 5457
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (25%) 4.00 4.00 6587
Lightly AI-edited 1 (25%) 6.00 4.00 4703
Fully human-written 1 (25%) 2.00 5.00 3021
Total 4 (100%) 4.00 4.25 4942
Title Ratings Review Text EditLens Prediction
From Editor to Dense Geometry Estimator Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper argues that image editing models are better than generative models for geometry estimation tasks like depth. It introduces FE2E, a framework that adapts an editor model for this task using three key techniques: a new 'Consistent Velocity' loss, a 'Logarithmic Quantization' method to fix precision issues, and a 'Cost-Free Joint Estimation' strategy. Despite using 100x less data (71K vs 62.6M), FE2E outperforms models like DepthAnything in zero-shot benchmarks. 1. The core argument is novel and well-supported: The central idea of the paper—that image editors (which must understand input structure) are better suited for I2I prediction tasks than T2I generators—is intuitive and insightful. The feature visualizations and the ablation studies provide strong evidence for this hypothesis. 2. Technical contributions are targeted and effective: This paper doesn't just fine-tune a model; it astutely identifies the key mismatches between the editing model and the estimation task (deterministic vs. stochastic; BF16 vs. high-precision). It proposes "Consistent Velocity" and "Logarithmic Quantization" as two specific and efficient solutions. The analysis of logarithmic quantization is particularly excellent. 3. Experimental results are impressive: The model achieves SOTA performance on multiple benchmarks (notably a 35% AbsRel improvement on ETH3D). Surpassing data-driven models like DepthAnything with only 0.2% of the data is a powerful demonstration of the data efficiency gained from leveraging the correct inductive bias. 1. Prohibitively High Computational Cost: This is the method's most significant drawback. The model requires 28.9T MACs and 1.78s for inference. This is an order of magnitude higher than diffusion-based competitors like Lotus-D (2.65T MACs, 212ms). While the authors acknowledge this trade-off, this latency and cost make the model impractical for most real-world applications (e.g., robotics, AR/VR) and situate it purely as an academic "best-performer" at any cost. 2. Limited Generalizability of the Core Premise: The paper's foundational claim—that "editors are better than generators"—is supported only by a direct comparison between Step1X-Edit and FLUX. While these share a DiT architecture, this represents a single data point (n=1 pair). This conclusion is an overclaim, as it's unclear if this finding would generalize to other model pairs, particularly the widely-used U-Net architectures (e.g., InstructPix2Pix vs. Stable Diffusion). The paper's claim is only truly supported for this specific DiT-based model family. 3. Unexplored Data Scaling Properties: The paper's primary narrative is built on data efficiency, showing SOTA results with only 71K images. However, the paper explicitly avoids exploring how FE2E itself performs when scaled to larger datasets. This is a critical omission. Does the model's performance saturate quickly (making it truly "data-efficient"), or would it also benefit massively from the 62.6M images used by competitors? This unanswered question makes it difficult to fully assess the model's true data-efficiency profile. 4. Conflated Ablation Study: The ablation study does not cleanly disentangle the impact of the three main contributions. For instance, the "Logarithmic Quantization" appears to provide the single largest performance boost. However, its effect is only measured after "Consistent Velocity" (CV) and "Fixed Start" (FS) are already applied. There is no baseline showing the effect of only DirectAdapt + LogQuant. This makes it difficult to ascertain the true individual importance of each component, and it suggests that the quantization strategy might be the dominant factor in the performance gain, more so than the novel flow matching loss. 5. Weak Rationale and Marginal Gain for Joint Estimation: The "Cost-Free Joint Estimation" is presented as a key contribution, but its justification is weak and its impact is minimal. The paper does not provide a strong intuition for why supervising the output p_l (which corresponds to the conditional latent $z^x$) should work. Furthermore, the empirical gain from this component is marginal; adding Joint Estimation only improves the ETH3D AbsRel from 3.9 to 3.8. This suggests it is a minor refinement rather than a significant contribution. 1.Given the model's massive computational overhead, have the authors explored any model compression techniques, such as distillation or pruning? Is it possible to distill FE2E into a more lightweight U-Net architecture (similar to Lotus-D) to achieve practical inference speeds while retaining most of the accuracy benefits? 2. Can the authors comment on why data scaling experiments were not performed? What is your hypothesis on how FE2E would perform if it were also trained on the 62.6M images used by DepthAnything? 3. To better understand the contributions, could you provide results for an ablation that applies Logarithmic Quantization directly to the DirectAdapt baseline? This would help clarify if the quantization strategy or the consistent velocity objective provides the majority of the performance gain. 4. Could the authors elaborate on the intuition behind the "Cost-Free Joint Estimation"? Why does supervising the output side p_l, which corresponds to the conditioning latent $z^x$, help the model learn? Did you test other, more traditional joint estimation methods (e.g., adding two separate prediction heads after the DiT) as a baseline? Fully AI-generated
From Editor to Dense Geometry Estimator Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This submission introduces FE2E, a novel framework that pioneers adapting DiT-based image editing models (e.g., Step1X-Edit) for zero-shot dense geometry prediction (depth/normal estimation), addressing limitations of existing T2I generative model-based approaches. Motivated by the misalignment of T2I models with I2I dense prediction, the authors systematically show editing models possess inherent structural priors—their fine-tuning "refines" rather than reshapes features, enabling more stable convergence and higher performance than generative counterparts (e.g., FLUX). To adapt editors for deterministic geometry tasks, FE2E reformulates flow matching loss into a "consistent velocity" objective (fixed) to eliminate trajectory errors, uses logarithmic quantization to resolve BF16 precision conflicts, and leverages DiT’s global attention for cost-free joint depth/normal estimation. Trained on only 71K synthetic samples (Hypersim + Virtual KITTI), FE2E achieves SOTA results: 35% AbsRel reduction on ETH3D, 10% on KITTI, and outperforms data-hungry models like DepthAnything (62.6M data) across 5 depth/4 normal benchmarks. Ablations confirm each component’s value—e.g., consistent velocity cuts ETH3D error by 10%, and joint estimation enhances cross-task supervision. 1. The work conducts rigorous, comparative empirical experiments to substantiate that image editing models are more suitable as priors for dense geometry prediction than traditional image generation models—an intuition-aligned conclusion. By selecting Step1X-Edit (editor) and FLUX (generator, sharing DiT architecture for fairness) as baselines, the authors analyze feature evolution, training dynamics, and final performance: early-stage editor features already align with input geometric structures (vs. unstructured generative features), editor fine-tuning “refines” rather than reshapes features (avoiding generative models’ unstable convergence and loss bottlenecks), and editing-based models consistently outperform generative counterparts (e.g., Step1X-Edit-based models reduce ETH3D AbsRel to 3.8 vs. FLUX-based 4.5). These experiments not only validate the intuitive advantage of editing models for I2I dense tasks but also provide clear, reproducible evidence to support the paradigm shift from generators to editor. 2. To resolve the precision conflict between image editors’ native BF16 format (sufficient for RGB but inadequate for high-precision depth/normal estimation), the authors propose a logarithmic quantization scheme—an innovative solution tailored to geometric prediction needs. Unlike uniform quantization (causing ~0.16m constant error, 1.6 AbsRel at 0.1m) or inverse quantization (failing at long distances, e.g., 39m/78m indistinguishable), logarithmic quantization applies and percentile-based normalization, maintaining balanced, low relative error (AbsRel≈0.013) across near (0.1m) and far (80m) ranges. This design not only preserves BF16’s computational efficiency (avoiding FP32’s cost) but also eliminates precision-related artifacts, directly contributing to FE2E’s SOTA performance (e.g., 19% KITTI AbsRel reduction vs. uniform quantization). This is useful for the other research that need to use FP16 models as prior model. The paper’s formulation of the "consistent velocity" training objective, while framed as an innovation, lacks novelty and ultimately reduces to an operation already mentioned in prior work (e.g., Lotus). The authors reformulate the editor’s original flow matching loss by fixing the starting point t_0=0 and enforcing velocity independence from time t, leading to the inference simplification: (Eq. in Sec 3.3). This derivation effectively equates the final geometric latent to the DiT’s direct output conditioned on the input image’s latent —a result functionally equivalent to directly predicting the target (e.g., depth/normal) from the input, rather than learning a "flow" from a noise origin. Critically, this "fixed starting point + direct output" logic was already explored in Lotus (He et al., 2024), where the model leverages conditional inputs to predict geometric targets without relying on stochastic noise trajectories. The paper fails to acknowledge this overlap or demonstrate how its "consistent velocity" differs from such prior deterministic prediction paradigms, weakening the novelty of this component. Also, the "consistent velocity" framework suffers from inconsistent input handling for joint depth and normal estimation, as visualized in Fig. 2 and contradicted by the paper’s own formulas. Per the formulation, the flow should start from a fixed origin for all geometric targets (depth and normal). However, Fig. 2 shows depth is derived from RGB image tokens while normal is derived from the fixed zero input—creating a discrepancy that violates the "consistent" premise of the velocity objective. This inconsistency raises critical questions: if the framework claims to unify geometric prediction via a fixed starting point, why are depth and normal treated differently in input sourcing? Further, the paper provides no experimental validation to clarify how input choices impact performance—an oversight that undermines the credibility of the "consistent velocity". For instance: There is no test of swapping inputs (e.g., depth from zero, normal from RGB tokens) to verify if the current design is optimal, or if the input-source difference drives performance gaps between depth and normal results. The paper does not explore uniform input conditioning (e.g., using RGB for both depth and normal, with positional encoding (PE) to distinguish tasks) to test if consistency in inputs improves joint estimation synergy. There is no ablation on learnable input tokens (e.g., analogous to LRM (Liu et al., 2023)) to compare fixed zero inputs against adaptive tokens, which could reveal whether the fixed origin is truly superior or merely a suboptimal simplification. Without these experiments, the paper cannot substantiate that its "consistent velocity" design is robust, or that its input discrepancy does not bias results—leaving the core objective’s validity. So base on above, I think authors should conduct experiments in the rebuttal: 1.swapping inputs 2. uniform input conditioning 3.learnable input tokens as input. Same as Weaknesses. Since I suspect that if the CONSISTENT VELOCITY design is robust, I give this paper with 4. The whole paper is good and provide several useful conclusions, so if the author can solve the problem I posted, I will rise my score and lead to accept this paper. Moderately AI-edited
From Editor to Dense Geometry Estimator Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces FE2E (From Editor to Dense Geometry Estimator), a novel framework for monocular depth and normal estimation. The core hypothesis is that pretrained image-to-image (I2I) *editing* models serve as a more suitable foundation for this I2I task than common text-to-image (T2I) *generative* models. The authors support this by comparing an editor to its generative base, showing the editor possesses more aligned geometric features. To adapt the DiT-based editor, the paper proposes three key technical contributions: (1) a "Consistent Velocity" flow matching objective with a fixed origin, which simplifies the deterministic prediction task; (2) a "Logarithmic Annotation Quantization" scheme to resolve the precision conflict between the model's native BFloat16 format and the high-precision demands of geometry estimation; and (3) a "Cost-Free Joint Estimation" method that leverages the DiT's architecture to predict depth and normals simultaneously without extra computation. Experiments show that FE2E achieves state-of-the-art zero-shot performance. The paper's primary strength lies in its novel and well-motivated premise of using image editing priors for dense estimation. This I2I-for-I2I paradigm shift is a novel conceptual contribution. This claim is well-supported by a systematic comparative analysis in Section 3.1, which convincingly illustrates that the editor model's features are inherently better structured for this task and lead to a more stable finetuning process. The technical contributions designed to bridge the gap between editing and estimation are solid and effective. The analysis of the BF16 precision conflict (Sec 3.4) is particularly thorough, and the proposed logarithmic quantization is a clever and necessary solution . The reformulation of the flow matching loss for a deterministic task (Sec 3.3) and the computationally elegant joint-estimation strategy (Sec 3.5) are also strong contributions. These elements combine to produce impressive state-of-the-art zero-shot results across multiple benchmarks, validating the data-efficient "From Editor to Estimator" approach. W1: Line 84. "However, these generative models are initially designed for T2I generation and lack the ability to capture the geometric cues from the absent image inputs." This motivation seems overstated. Recent T2I-based approaches (e.g., Marigold) have demonstrated strong robustness in depth estimation, suggesting that T2I priors can, in practice, encode meaningful geometric cues. We recommend softening the claim and positioning the contribution as complementary. W2: In sec 3.1, we suggest the authors to briefly introduce the background and overall task setting of the experiments before describing the experimental setup details. Providing this high-level context would help readers better understand the paper. W3:The "Auxiliary Dispersion Loss" ($\mathcal{L}_{disp}$) is only introduced in Appendix A.1. Table 5 demonstrates its non-trivial performance contribution, and it is part of the final training objective. Omitting this from the main methodology section (Sec 3) makes the description of the final, best-performing model (ID8) incomplete and potentially misattributes the source of its full performance. W4:The analysis of the "cost-free joint estimation" mechanism (Sec 3.5) is somewhat shallow. The paper states that the DiT's global attention "naturally allows" for information exchange and that this enables "mutual enhancement", but it does not provide a deeper analysis (e.g., attention visualization) to show *how* or *where* this information is exchanged and leveraged. The empirical gain from this component (ID6 vs. ID8 in Table 4) appears positive but modest, making it hard to assess the true impact of this synergy. Q1: How dependent are the proposed contributions, particularly the "Consistent Velocity" objective (which simplifies inference to a single step ) and the "Cost-Free Joint Estimation" (which relies on specific DiT input concatenation), on the DiT architecture? Could these ideas be generalized to other editor architectures, such as those based on U-Nets? Q2: I am a bit confused by the "consistent velocity" reformulation in Section 3.3. The original flow matching (Eq. 2) predicts velocity $f_{\theta}(z^{x},z_{t}^{y},t)$, which is time-dependent. The new formulation (Eq. 4) removes the time $t$ dependency and (Eq. 5) removes the $z_{0}^{y}$ dependency , ultimately simplifying inference to $z_{1}^{y} = f_{\theta}(z^{x})$. This seems to imply the model is no longer a flow-matching model but a simple regressor. Could the authors clarify if this interpretation is correct and why this is still framed as "flow matching Lightly AI-edited
From Editor to Dense Geometry Estimator Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces FE2E, a novel framework for zero-shot monocular dense geometry prediction by adapting a Diffusion Transformer-based image editing model. The central premise of this work is that image editing models are more suitable for dense geometry prediction tasks than noise-to-image generative models. The authors reformulate the training objective with a Consistent Velocity loss and introduce Logarithmic Quantization to successfully convert the editor into a deterministic estimator for depth and surface normals. FE2E achieves state-of-the-art performance on various zero-shot benchmarks. 1. The paper innovatively adapts the pre-trained image editing model for dense geometry prediction, rather than the noise-to-image generative models like other diffusion-based methods. 2. The paper achieves superior quantitative results across multiple benchmarks. Although the final performance is promising, the paper still lacks solid justifications for the key designs. The statement "editing models possess inherent structural priors" is empirical and insufficient to support acceptance. 1. The core claim that "editing models are fundamentally better for dense geometry prediction" is intuitive, but the evidence provided is weak. - Step1X-Edit is fine-tuned from FLUX, is the performance gain due to Step1X-Edit's additional fine-tuning and data resource? - In Fig.3, what is the specific value of the time-step $t$ used for the FLUX-based method? If a very small $t$ (e.g., $t = 0$) is used, the DiT's input in the FLUX model would be close to the clean image. This makes the FLUX-based method also operate like an image-to-image translation in your setting. In this case, does the FLUX-based method still exhibit "abstract and unstructured" features? 2. Why the time-step t can be remove in Eq. 4, is there any theoretical evidance? Furthermore, given that the base Step1X-Edit model relies on $t$, how do you deal with it during finetuning? Also, why does $z_{0}^{y}=0$ enable greater consistency? 3. Does Fig. 5 consider the input image $z^x$ (the condition of the editing model)? If so, why is there only a single data point shown on the left of Fig. 5(c)? 4. Sec. 3.4 requires a clearer explanation to justify the logarithmic method. Why the $\pm[0.5, 1.0]$ interval is the worst-case for BF16 and why the RGB (also be normalized to [-1,1]) succeeds while the depth map fails? The analysis of the proposed logarithmic method needs more detailed theoretical explanation for why it is inherently superior to both the uniform and inverse methods. 5. The cost-free joint estimation is not novel. For example, GeoWizard also connects the depth and normal estimation via the cross-domain self-attention, and DICEPTION also utilizes the DiT advantage for multi-tasks. 6. Aside from utilizing the pre-trained editing model, how does your method differ from other single-step generation-based approaches? See the weaknesses. If the authors can address my concerns, I am willing to change my score. Fully human-written
PreviousPage 1 of 1 (4 total rows)Next