|
OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper studies unified 2D GUI and 3D embodied control with a single MLLM, showing naïve joint training hurts because optimization is synergistic in shallow layers but conflicting in deeper ones. It introduces a layer-heterogeneous MoE that shares shallow blocks but splits deep experts with task-type routing to reconcile domains. It yields consistent gains compared with baselines.
* The paper is well-motivated, and the proposed method makes sense.
* It shows improvements over both training separate single-domain agents and naïvely mixing all data into one model.
* The gains on benchmarks are modest and may fall within variance (e.g., 77.1 vs. 74.5 on GUI tasks).
* In section 4.3, the authors propose a metric called "parameter update similarity". Although they show differences in GUI and Embodied groups, more evidence about in-group metric is needed to further support the claim. If in-group metric shows similar trend then the claim may be incorrect.
* Table 2 should include a MoE variant without explicit expert separation to test whether improvements stem from MoE capacity rather than task-specific partitioning.
* The appendix should provide fuller experimental details (prompts, representative training samples, and evaluation samples) to support reproducibility.
* Another limitation of this work is that the scope is quite limited with only offline GUI control and one simulated robot task. Including online GUI evaluation and real-robot experiments (e.g., as in Magma) would broaden the contribution.
Please refer to the weakness part. |
Lightly AI-edited |
|
OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a unified multimodal agent that can perform both 2D GUI and 3D embodied tasks. The authors find that naive joint training causes conflicts between modalities, so they propose a Layer-Heterogeneity Mixture-of-Experts architecture that shares shallow layers and separates deep layers. Trained on large-scale standardized datasets, OmniActor achieves state-of-the-art or superior performance across both GUI and embodied benchmarks, demonstrating effective generalization between virtual and physical environments.
1. This paper presents a clear insight into the conflict–synergy relationship between GUI and embodied data, supported by empirical analysis.
2. The proposed Layer-Heterogeneity MoE is biologically inspired, effectively balancing parameter sharing and specialization.
3. The authors demonstrate strong experimental results, outperforming both domain-specific and prior generalist models across multiple benchmarks.
1. The text on the left side of Figure 3 is too small to read clearly.
2. Please verify whether LIBERO-90 should be the training dataset. Typically, LIBERO-10 is used as the test set.
3. Table 1 could be improved by including some recent works on LIBERO published in 2025.
4. I suggest that the authors include real-robot experiments to demonstrate the algorithm’s generalization capability in the 3D physical world.
I5. It would be valuable to add backbone ablation studies, such as experiments using Qwen-3.
1. Why did the authors choose to fuse 2D and 3D scenes specifically using robot manipulation and GUI grounding tasks? Does this choice lead to excessive data heterogeneity?
2. Could this design cause a cross-domain data imbalance issue, and if so, how is it addressed?
3. In the dataset, are the GUI and robot data quantities balanced? Should more diverse robot datasets be incorporated? |
Lightly AI-edited |
|
OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper explores the synergy between training generalist agents on UI and embodied action data. OmniActor is a generalist agent across embodied and UI tasks, leveraging a mixture of expert architecture for deep layers to separate the UI and embodied knowledge. Empirical results demonstrate OmniActor outperforming training with only one data source and prior UI or embodiment-specific baselines.
1. The paper addresses the important problem of generalist models between UI and embodiment data. Both domains already have limited amounts of interactive data. Therefore, it's important to leverage the synergies from both to improve performance.
1. The results in Table 1 show that OmniActor outperforms baselines in UI control tasks and closely matches specialized embodiment-specific baselines in embodied tasks.
1. Through the proposed parameter update similarity metric, the paper illustrates how the parameter update diverges for deeper layers between the two data modalities in Figure 4. This motivates the intuitive solution of an MoE for deeper layers.
1. The paper does not sufficiently explore the ways of disentangling the UI and embodiment knowledge, and only explores one MoE approach. For example, is it possible to share only the FFN or attention modules as well, or is it necessary to separate both? This is a key contribution of the work and thus should be sufficiently analyzed.
1. The paper does not analyze how to select the "deep layer threshold" K value (L468). Likewise, the paper does not analyze how robust the training is to the choice of K. This value is a key hyperparameter that could have a large impact on performance since it addresses the core problem of conflicting optimization directions. It is not clear how to select K based on the visualization in Figure 4 alone. The parameter update similarity appears similar for deeper layers as well in Figures 4a and 4b.
1. The paper does not compare to other VLA models in embodied tasks, such as OpenVLA or pi0. These VLA models are more directly comparable to OmniActor since they are also based on a VLM, are more recent, and perform better on the Libero benchmark.
1. For embodied tasks, the paper only evaluates on the Libero benchmark. Likewise, only data from Libero is used for training. Only evaluating on a single embodied benchmark is contrary to the claim that OmniActor is a "generalist agent". A generalist agent should be evaluated on multiple embodiment benchmarks (for example, CALVIN or SimplerEnv).
1. OmniActor achieves strong performance relative to baselines largely because of the starting OmniActor framework rather than a strong cross-domain transfer. The OmniActor, only trained on single-domain data in GUI or embodied tasks, already outperforms most baselines in Table 1.
1. The paper does not analyze which capabilities are transferring between GUI and embodied domains, just that there is broad transfer happening on the benchmarks. Which tasks in Libero and the GUI tasks is the embodiment data improving performance in? This is important for understanding why this cross-domain data is helpful. Likewise, it is important to study for which individual subtasks or behaviors the cross-domain data hurts performance in with the default training to better understand where OmniActor helps.
1. How does performance vary with different choices of K?
1. How does OmniActor handle varying resolutions between GUI and embodiment data? Other GUI benchmarks, like AndroidWorld, have high resolution images (2400x1080) which are necessary to perceive and interact with all the GUI elements. On the other hand, Libero images are 256x256. L676 says all the images are resized to 448x448, but how would the OmniActor framework train with high-resolution GUI images as well?
1. Why is does OmniActor trained with only the GUI data achieve such higher performance than prior works across generalist and GUI agents?
1. Why is the transfer from GUI to embodied tasks higher than vice versa? Is it because there is more GUI data?
1. Why use distinct prediction heads for GUI and embodied actions, since the vocabulary of these action spaces is distinct? Since the action space is distinct, the token embeddings do not overlap, so there is no difference in splitting versus keeping the embeddings the same. |
Fully human-written |
|
OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 10: strong accept, should be highlighted at the conference
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces OmniActor: a unified vision-language-action agent for GUI and Robotic applications.
There are two main contributions in this paper:
1- Solving the common data conflict problem for generalist agents by observing that the data conflict leads to gradient updates that are consistent in shallow layers, but divergent at deeper layer. The solution suggested to this is by sharing the parameters in the shallow layers but specializing the experts (weights) for the deeper layer closer to the action decision.
2- Unifying the data and action formats across both GUI and Robotics domains.
The paper has multiple evaluations on various GUI and Robotic benchmarks, and using multiple base models, showing significant uplift and the ability to generalize.
- The paper makes a very insightful observation about the data conflict problem, which is an important challenge for generalist agents.
- The change suggested to solve the data conflict problem is rather simple and easy to implement, and the results show significant improvement.
- The paper is very well written and illustrated.
- The ablation studies and changing the base model evals are very well done. Makes me trust the results more.
- The suggested separation method seems arbitrary in a few design decisions (e.g. how to decide on the T parameter for shared layers properly, and the granularity of the separation/specialization: why separate GUI and Robotic only but not Mobile and Computer Use)
- I have doubts about how this separation/specialization method would scale if we have lots of domains not just two.
- The paper does not use any kind of RL to improve the reasoning like other similar systems. Would be interesting to see how RL can be combined with this finding to produce even better numbers.
- The separation/specialization algorithm used here is manual and depends heavily on similarity statistics (i.e. check the instance type gui/robotics, then route to the correct set of weights). Do you think that this can be automated within the neural architecture itself (i.e. make it a part of the learning process)? |
Fully human-written |