ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (67%) 5.00 3.50 2782
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (33%) 2.00 4.00 1264
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 0 (0%) N/A N/A N/A
Total 3 (100%) 4.00 3.67 2276
Title Ratings Review Text EditLens Prediction
Entrophy: User Interaction Data from Live Enterprise Workflows for Realistic Model Evaluation Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces ENTROPHY, a dataset of real-world enterprise workflows, capturing 33 hours of detailed digital interactions across finance, legal, and HR domains. Recorded from professionals performing authentic tasks, the dataset logs 283 workflow instances across 19 applications, integrating clicks, keystrokes, and screenshots. Benchmarking top LLMs on workflow classification and segmentation tasks shows limited accuracy, highlighting major challenges for AI automation in complex, real-world enterprise environments. 1. The benchmark is expertly labeled, with a clearly documented and rigorous annotation process. 2. The inclusion of realistic “noise” in the dataset enhances ecological validity and better reflects real-world enterprise workflows. 1. The experimental focus on classification and segmentation appears dated in the LLM era, where workflow generation and execution are more relevant research directions. 2. Given the dataset’s multimodal nature, the evaluation method for text-only models requires clarification. 3. The benchmark may mislead readers due to its resemblance to “entropy” in information theory. 4. The empirical findings in Section 5 are somewhat expected and offer limited novelty for the community. See the weakness part. Moderately AI-edited
Entrophy: User Interaction Data from Live Enterprise Workflows for Realistic Model Evaluation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a new dataset ENTROPHY that contains real-world enterprise workflows. It was collected through fine-grained digital interaction logging (clicks, keystrokes, hotkeys) across 283 workflow instances, covering 33 hours of activity over 19 applications. The dataset spans finance, legal, and HR domains and captures structured workflow sequences. It benchmarks several LLMs on workflow classification and workflow segmentation tasks. 1. The paper targets an important problem in large language models, the workflow automation. Understanding and modeling enterprise workflows is a critical direction for building capable and trustworthy AI agents, and the authors make a notable effort to address this gap. 2. The dataset captures real-world, complex enterprise tasks that are rarely accessible to the research community. Collecting such data in authentic business environments is extremely challenging due to privacy, security, and compliance constraints. Thus, releasing ENTROPHY as an open dataset is a valuable contribution that will likely stimulate further research in realistic workflow modeling. 3. The paper provides a clear and comprehensive description of the dataset construction pipeline, data composition, and domain distribution. 1. While the dataset is positioned as a benchmark for workflow-related research, the paper only explores two downstream tasks, workflow classification and workflow segmentation. From an LLM and agentic research perspective, the community is increasingly interested in whether models can autonomously construct or generate workflows from natural instructions or demonstrations (e.g., [1, 2]). Evaluating LLMs solely on recognition or segmentation tasks underutilizes the dataset’s potential. 2. Closely related to the previous point, the paper does not introduce or provide an executable environment or workflow execution interface that could support generative workflow construction or end-to-end task completion. Without such a platform, the dataset’s applicability to studying workflow synthesis, planning, or tool orchestration remains limited. [1] WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models. ICLR 2025. [2] Generalizing Experience for Language Agents with Hierarchical MetaFlows. NeurIPS 2025. Suggestions for Improvement: 1. Consider adding a workflow construction or synthesis benchmark, where models must generate executable workflow representations or action sequences given textual task descriptions or partial demonstrations. Such an extension would make ENTROPHY more relevant for the current LLM-agent community and better connect with ongoing work in tool-using agents and process automation. 2. Optionally, the authors could also describe plans for an evaluation environment or simulation layer that allows future research on workflow execution or reinforcement learning over this dataset. Fully AI-generated
Entrophy: User Interaction Data from Live Enterprise Workflows for Realistic Model Evaluation Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper investigates applying information-theoretic entropy to optimize AI agent-user interaction strategies. The authors propose an entropy minimization framework that dynamically adjusts questioning strategies to reduce user burden while maintaining task completion quality. Validation across multiple interactive task scenarios demonstrates that the approach significantly reduces interaction rounds compared to conventional methods. 1. **Solid theoretical foundation**: Information-theoretic framework provides strong mathematical grounding, with clear and reasonable entropy minimization objectives 2. **Problem-focused approach**: Addresses a genuine pain point in AI-user interaction (excessive interaction rounds) with a systematic solution 3. **Well-designed experiments**: Coverage spans simple to complex interaction scenarios, validating generalization capability 4. **Significant effectiveness**: Results show substantial reduction in interaction rounds (30-40% average decrease) while maintaining task quality 5. **User study support**: Includes real user experiments with both objective metrics and subjective satisfaction data 1. **Computational overhead**: Real-time entropy calculation may introduce significant overhead - the paper insufficiently addresses this. At scale, this could become a bottleneck 2. **Oversimplified user modeling**: Assumes user response information content can be accurately modeled, but users vary widely in expertise and communication style. How is this uncertainty handled? 3. **Cold start problem**: For new users or novel task types lacking prior information, how reliable is entropy estimation? This isn't adequately discussed 4. **Limited baseline comparisons**: Primarily compares against simple heuristic methods, lacking comparison with other learning-based interaction optimization approaches 5. **Restricted applicability**: The method seems better suited for information-gathering tasks. For open-ended interactions like collaborative creation, entropy optimization may be less effective 1. When users provide vague or incomplete responses, how do you adjust entropy estimates? Is there an adaptive mechanism? 2. Optimal interaction strategies likely differ for different user types (experts vs. novices) - how can entropy minimization be personalized? 3. In multi-turn interactions, have you considered user fatigue effects? Users may tend toward briefer responses over time 4. For tasks requiring creative input, might entropy minimization constrain solution space exploration? 5. Are there plans to integrate this method into production systems? What engineering challenges do you anticipate? Fully AI-generated
PreviousPage 1 of 1 (3 total rows)Next