|
COGITAO: A Procedural Object-Centric Framework to Study Compositional Generalization |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work introduces COGITAO, a procedural, object-centric framework designed to generate millions of unique tasks for evaluating compositional generalization in models by composing 28 atomic transformations on grid-based objects.
Experimental results reveal that despite strong in-domain performance, state-of-the-art models, including specialized architectures like Grid Transformers and recurrent Pondering Looped Transformers, struggle significantly with out-of-distribution generalization, highlighting the challenge of systematic compositional reasoning in novel settings.
- The transformation proposed in this work is highly flexible and procedurally designed. The framework supports the composition of 28 distinct atomic transformations, such as translation, rotation, mirroring, and cropping, applied to objectis in grid-based environments, allowing to be processed in a sequence with any abitrary depth.
- This generator also allows fine-grained control over environmental parameters, including grid size, number of objects, object complexity, and color patterns, which facilitates systemtic testing of both compositional generalization and environmental generalization.
- The framework is comprehensive and extendable to Sequential-COGITAO, which outputs intermediate transformation steps, and to COGITAO-RGB, which renders tasks in a natural image format to bridge the gap to real-world vision applications.
- The proposed 28 atomic transformation rules are not able to perform on real-world concepts, like natural objects. They are only available for the proposed grid-like inputs.
- To improve clarity, line 111 should direct the reader to Appendices A and B for a discussion of the RGB and sequential frameworks, which are detailed there but not in the main body of the paper. Moreover, the author does not provide any experimental study on RGB and sequential settings.
- The experimental settings lack clarity in the main text. For instance, the mathematical formulation of the supervision signal mentioned in line 331 is not explicitly defined. Based on the metric described in Table 1's caption ("test grid accuracy (i.e., % of perfect matches"), I infer the task is a per-grid-cell classification. This key detail about the learning objective should be stated explicitly before.
- The manuscript's organization could be improved, particularly by providing deeper insights and analysis in the experimental results section. The proposed benchmark is supposed to show why SOTA models fail to generalize to OOD cases, especially with more composition.
- In Figure 1, what is the transformation process for example task A? How to generate the output grid? What is "(mod 9 + 1) color" in line 819?
- Are there any two different transformation processes that will produce the same output grids for the same input grids?
- Where is Table 4.4 in line 345 and line 367? Is it Table 1?
- In Table 1, intuitively, C1 is easier than C3 since C1 tests on depth 2 and C3 tests on depth 3. Why does C3 have better ID and OOD performance than C1 for all the cases in Table 1?
- What are the key differences between COGITAO and CLEVR? |
Fully human-written |
|
COGITAO: A Procedural Object-Centric Framework to Study Compositional Generalization |
Soundness: 2: fair
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces COGITAO (COmpositional Generalization In Transformations And Objects), a dataset for evaluating compositional generalization in abstract visual domains. Inspired by the ARC-AGI benchmark, COGITAO aims to provide a large-scale, controllable environment that enables scientific study of compositional generalization. The dataset consists of input–output grid pairs, where the output is generated by applying a sequence of transformations (e.g., rotation, translation) to the input. The dataset defines two generalization regimes: (1) compositional generalization, which tests generalization to novel transformation sequences, and (2) environmental generalization, which varies object and grid properties. The authors evaluate four architectures: Vanilla Transformer, Grid Transformer, Pondering Looped Transformer, and LLaDA and showed that they fail to generalize.
1. Originality: The dataset is based on prior datasets such as ARC and PGM, but introduced more diverse transformations (28 atomic total) and objects.
2. Quality: The authors clearly defined the two generalizations (compositional and environmental) and carefully designed the transformations and generalization splits. The authors also tested diverse architectures.
3. Clarity: The paper is well-organized and clearly written
4. Significance: COGITAO provides a generative framework for analyzing and improving compositional generalization in vision models. It offers a large-scale benchmark that can guide the development of models for challenging tasks such as ARC.
1. Motivation Clarity: The motivation for why this new benchmark is necessary could be improved. The manuscript argues that ARC’s dataset is too small and heterogeneous for systematic study. However, ARC was intentionally designed to be small to probe generalization without overfitting through large-scale training, and related benchmarks such as PGM (Hill et al., 2019) already operate at million-scale sample sizes. Thus, dataset size alone does not fully justify the need for COGITAO. To clarify its unique contribution, the authors could provide concrete examples of the new scientific questions or analyses that COGITAO enables, but prior benchmarks cannot.
2. Connection to Broader Context: The paper deliberately focuses on training models (e.g., transformers) via supervised learning and omits large pretrained models or in-context learning, arguing that training details for foundation models are opaque. While this focus is understandable from a scientific control perspective, it leaves unclear how COGITAO relates to the current frontier of model capabilities. Recent progress, such as reports of strong performance by O3 on ARC, raises the question of whether large-scale pretrained systems may already solve the generalization task. e.g. recent report on ARC by O3? If so, what specific research questions remain unresolved that make COGITAO valuable? Does the dataset primarily expose limitations of small architectures trained with supervised loss? The paper would benefit from at least a baseline evaluation or discussion of how large pretrained models perform on COGITAO.
3. Evaluation Metrics: The paper relies on exact grid match accuracy as the sole evaluation metric, but this choice is neither well-justified nor fully aligned with the goal of studying generalization mechanisms. The metric is extremely strict: a one-pixel offset or minor transformation error, results in a total failure. The authors should clarify how this metric captures the underlying compositional reasoning ability they aim to evaluate and whether alternative or complementary metrics (e.g., per-object accuracy) might provide a more informative picture of model behavior. In addition, no error analysis is presented to reveal what kinds of transformations or compositional structures cause failure, or how performance degrades with transformation sequence length.
4. Finally, there is no human evaluation. The example tasks appear challenging even to humans. It would be valuable to include a human baseline to contextualize model performance
1. Dataset Difficulty and Learning Regime: The reported in-domain accuracies are relatively low (e.g., 30% for Transformer, 60% for Grid-Transformer, 82% for PL-TF, and 84% for LLaDA). This raises the question of whether the current training regime truly enables the study of compositional generalization, or whether models are still struggling with basic task learning. Could the authors comment on whether the dataset might be too difficult or insufficiently calibrated for the intended analysis? Why were models trained for only 10 epochs given the modest dataset size
2. The models were trained on 100K examples—smaller than datasets like PGM, which operate at the million-sample scale. Given this, how did the authors determine that the dataset size was sufficient for convergence and reliable evaluation of generalization?
3. Model Selection: Why were only Transformer-based models evaluated? Have the authors considered testing architectures explicitly designed for relational reasoning, such as Relational Networks? |
Lightly AI-edited |
|
COGITAO: A Procedural Object-Centric Framework to Study Compositional Generalization |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes an object-centric procedural generative framework for studying compositional generalization of deep learning models. It features grid-based objects (but also supports "to RGB" rendering for vision models) and 28 atomic transformations that can be freely composed to generate a large number of downstream tasks. Using this framework, the paper presents two major use cases (studying compositional and environmental generalization), establishing multiple benchmarks.
- Generally, one risk of procedurally generated datasets is that most tasks may overlap. COGITAO instead presents non-redundant & non-conflicting (in a loose sense) transformations which makes it convincingly flexible and composable, if properly used.
- The chosen two use cases (and the resulting benchmarks) are meaningful and interesting to assess the intended generalization capabilities. Moreover, I see another use case (also mentioned briefly by the authors) which could broaden the scope and impact of COGITAO: it can be used to study implicit biases of architectures.
- The paper's writing quality is good, the choice of images and tables help the presentation and the contribution is clear.
The paper's contribution is solid. However, there are two major points that require attention.
**W1.**
Although presenting a framework to study compositional generalization, I found complete lack of acknowledgement about _provable_ object-centric learning literature & compositional generalization (eg. see [1,2] and their related works). The only referenced work I could find is a classical work on Slot attention (Locatello et al., 2020), for which a similar mechanism is implemented inside Grid-TF.
-----------------------
**W2.**
On the same note of **W1.**, being COGITAO grid-based (eventually extended to vision), I feel the paragraph "Compositional Generalization in Language" in the Related Works section is out of scope. I'd argue the paper would rather benefit from a section dedicated to compositional generalization methods.
-----------------------
**_References:_**
[1] Brady, Jack, et al. "Provably learning object-centric representations." ICML 2023.
[2] Wiedemer, Thaddäus, et al. "Provable compositional generalization for object-centric learning." ICLR 2024.
Thanking in advance for their response, I'd kindly ask the authors the following questions.
- **(Continuing from W1.)** Being the dataset akin to toy settings in [1,2] the paper could benefit from trying also their methods. For instance, [2] provably allows for compositional generalization, although I'm unsure all of their assumptions might hold in the benchmarks supported by COGITAO.
- Do the authors plan to extend their code base to include also a slot-based differentiable renderer? (even implemented just as finite differences. For instance, allowing for taking derivatives of grid cells with respect to slots as done in [1,2]). This would make COGITAO also very useful for supporting theoretical research.
- I'd kindly ask the authors to address **W2.** in the Weaknesses section of this review. |
Fully human-written |