|
All in One: Unified Pretraining of GUI Agents via Masked Trajectory Prediction |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes a new training paradigm which uses Masked Trajectory prediction to unify all the open-sourced datasets. It also proposes role-aware adapters to address the challenge of data heterogeneity.
- proposes masked trajectory prediction for a unified training paradigm
- gathers open-source datasets and process to a UniTraj dataset for the MTP training
- uses role-aware adapters for data heterogeneity
- The experiment result does not show strong improvement of MTP training. In Table 1, compared with direct mixture training, the performance gain of MTP is only <1%. In table 2, the improvement from base model Qwen2.5-VL is very likely due to the substantial training data, while the improvement is barely around 1.4%.
- The experiment results do not show the strong contribution or application of the "pretraining" as defined in the paper. In table 2 and 3, even though with the substantial training data, Qwen2-VL+MTP performance still falls behind MiMo-VL, Qwen2.5-VL base models. Besides, from my knowledge, UI-Genie with only 90K open-source trajectories can achieve 74.2 on AndroidControl-High and 94.3 on AndroidControl-low which is even higher than Qwen2.5-VL+MTP.
- The study lacks the experiments on dynamic evaluation system such as AndroidWorld or AndroidLab or OSWorld. Currently the latest researches all evaluate their agents on such systems which represent a more realistic evaluation, since the static benchmarks such as AndroidControl or Mind2Web has significant drawbacks.
- More agents should be compared or listed, such as UI-TARS-1.5, GUI-OWL, UI-Venus, UI-Genie, etc.
- In Table 1, why 2 "step acc" columns are listed with different numbers? Might be typo |
Fully human-written |
|
All in One: Unified Pretraining of GUI Agents via Masked Trajectory Prediction |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes to address the problem of inconsistent training objectives and data heterogeneity in GUI pretraining. It proposes a unified training framework, named masked trajectory prediction (MTP). Extensive experiments on four benchmarks varify the effectiveness.
1. This paper tackles a novel problem in GUI pretraining. It is interesting to me.
2. The contribution of MTP is novel.
1. The performance gains over four benchmarks are relatively limited. Also, it is unclear how does the model perform when simply supervised finetuned on UniTraj dataset? Additionally, how does MTP compare with RL-based algorithm, or can MTP incorporate with RL, since the supervised finetuning-style methods have obvious performance ceilings.
2. The evaluated benchmarks are not that challenging. How does MTP perform in online settings (e.g., AndroidWorld, OS-World) ?
3. Is MTP sensitive to the quality of training data ? Could MTP benefit from further data scaling ?
See above |
Fully human-written |
|
All in One: Unified Pretraining of GUI Agents via Masked Trajectory Prediction |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes MTP, a framework for the unified modeling of GUI data. By masking different components of GUI interaction trajectories and incorporating a role-aware adapter learning module, MTP unifies various training objectives and achieves state-of-the-art results on several benchmarks.
1. **Novel Unified Framework:** The paper introduces an innovative framework to uniformly model GUI Agent data. Through masked trajectory prediction and a role-aware adapter learning module, the model learns effectively from diverse types of training data.
2. **Clever Adapter Design:** The role-aware adapter learning module ingeniously trains multiple LoRA adapters to create a routing mechanism, which helps to further mitigate the negative impacts of data heterogeneity.
1. **Unified Modeling May Not Address Core Data Heterogeneity:** The idea of unified trajectory modeling is interesting but may not solve the fundamental problem of data heterogeneity. The authors classify GUI data into STP, ACP, and MP. While this classification is reasonable from a data format perspective, it doesn't delve into the essential capabilities of a GUI agent. A truly capable GUI agent requires domain knowledge, environmental awareness, and GUI reasoning. A training approach that starts from these core capabilities would be more meaningful than one that merely unifies data formats. A brute-force unification cannot resolve the skewed data distribution for each required capability, especially given that the distribution of open-source GUI agent data is severely imbalanced (e.g., abundant grounding data at 50-100M samples vs. scarce trajectory data around 100K). This introduces deeper issues related to data balancing and quality. For related perspectives, see the UI-Tars series[1,2] and OpenCUA[3].
2. **Misclassification of OS-Genesis Data:** Having thoroughly studied the OS-Genesis paper and its data, I am certain that its synthetic trajectories contain high-level instructions. According to this paper's own definitions, these should be classified as "semantically coherent trajectories." However, they are classified as "logically coherent trajectories," which is clearly incorrect.
3. **Lack of Online Benchmark Evaluation:** The experiments are missing an evaluation on online benchmarks, such as AndroidWorld. A GUI agent's true utility can only be proven by its performance in real-world, dynamic environments.
4. **Typo in Figure 2:** There is a drawing error in Figure 2. In the text portion, the components in $a_2m_3[Mask]$ are not separated by commas.
[1] Qin Y, Ye Y, Fang J, et al. Ui-tars: Pioneering automated gui interaction with native agents[J]. arXiv preprint arXiv:2501.12326, 2025.
[2] Wang H, Zou H, Song H, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning[J]. arXiv preprint arXiv:2509.02544, 2025.
[3] Wang X, Wang B, Lu D, et al. Opencua: Open foundations for computer-use agents[J]. arXiv preprint arXiv:2508.09123, 2025.
1. **Cross-Platform Action Space:** I am puzzled that the UniTraj dataset is said to span "five major operating systems," yet the action space is meticulously designed for Android. I understand some actions can be generalized, but how can the model cover the desktop action space without fundamental actions like `click`, `rightClick`, and `scroll`?
2. **Benchmark Coverage:** Following the previous question, if the data covers five operating systems, the evaluation should also include benchmarks from these different platforms. The current benchmarks are almost all for Android (AndroidControl, GUI-Odyssey, AITZ), with only one for the web (Mind2Web). There is a complete lack of desktop benchmarks. Perhaps the authors could evaluate on an offline desktop benchmark like AGENTNETBENCH[1]? This would be more convenient than online benchmarks like OSWorld or WAA, although an online desktop evaluation would be a more powerful response.
3. **Gains from Role-Aware Adapter Learning:** My understanding is that the role-aware adapter learning module is intended to route heterogeneous data on top of the unified model to further mitigate performance degradation from data diversity. The idea is good, but the experimental results do not show a significant gain. What could be the potential reasons for this?
4. **Suggestions for Figure 2:** I suggest improving Figure 2 by explicitly illustrating the different trajectory types. The formulas can be removed from the figure as they are already present in the text. Additionally, making the visual representations for inst, m, r, and a more distinct would improve clarity.
5. **Clarification on "m":** Perhaps I missed it, but I could not find a clear definition for the component labeled 'm'. Is it intended to represent the screenshot?
6. **Clarification on Lines 264-266:** I do not fully understand the statement in lines 264-266: "trajectories can contain up to six components but often include only a subset." What do these six components specifically represent? Neither the text nor Figure 3 provides a clear explanation.
[1] Wang X, Wang B, Lu D, et al. Opencua: Open foundations for computer-use agents[J]. arXiv preprint arXiv:2508.09123, 2025. |
Lightly AI-edited |
|
All in One: Unified Pretraining of GUI Agents via Masked Trajectory Prediction |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposed a unified framework named Masked Trajectory Prediction. It consolidates diverse pretraining strategies into a consistent training objective through a masking-based manner. A role-aware adapter learning module is designed to dynamically route each token to an appropriate optimization path. The experimental results on four benchmarks show the effectiveness of the proposed method.
1. The proposed MTP establishes a consistent training objective for GUI agents. The experimental results on standard benchmarks show the effectiveness and generalization of MTP.
2. The proposed MTP is simple and effective.
1. While the proposed MTP outperforms direct unified mixture training, the performance gains are marginal. As detailed in Table 1, the improvements range from 0.07% to 0.82%. These minimal gains raise questions about the practical significance of the method.
2. Given the ablation study presented in Table 1, the performance gains observed in Table 2 are likely attributable to the training data rather than the MTP itself.
The major concern of this paper is the limited performance gains. Please see the Weaknesses. |
Lightly AI-edited |