|
Forge: Foundational Optimization Representations from Graph Embeddings |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper introduces FORGE (Foundational Optimization Representations from Graph Embeddings), a framework designed to create general-purpose MIP embeddings that can serve multiple combinatorial optimization tasks without relying on solver-specific solutions. FORGE is a vector-quantized graph autoencoder trained on a diverse set of Mixed-Integer Programming (MIP) instances in an unsupervised manner. The key contributions include the creation of a foundational model capable of producing embeddings for optimization instances that can generalize across problem domains and sizes. The paper demonstrates the use of these embeddings in both unsupervised and supervised settings, improving solver performance and outpacing existing learning-based methods in tasks like integrality gap prediction and search guidance. The approach is designed to work with commercial solvers like GUROBI, showing significant improvements in optimization tasks such as cut-generation and variable prediction for search guidance.
1.
FORGE leverages unsupervised learning to pre-train embeddings from a wide range of MIP instances, creating general-purpose representations that can be used across multiple optimization tasks and problem sizes without needing optimal solutions or labeled data. This broadens its applicability compared to other supervised or task-specific models.
2.
The FORGE embeddings show strong generalization capabilities across different problem domains and sizes. The method not only works for one specific task but can be fine-tuned to improve performance for various tasks, such as integrality gap prediction and search guidance, demonstrating versatility and adaptability.
3.
FORGE is integrated with GUROBI, a commercial optimization solver, and improves primal gaps for several problem types. The pre-trained embeddings help enhance solver performance, reducing gaps and speeding up convergence.
1.
Although FORGE performs well, the interpretability of the learned embeddings remains unexplored. The paper briefly mentions that some codes may represent local structures like cliques, but a deeper understanding of how different parts of the embedding space correspond to specific problem features would be valuable.
2.
While the model is compact (with 3.25 million parameters), training on very large datasets might still pose challenges in terms of computational resources. The ability to scale this model to even larger MIP instances (e.g., 100K+ nodes) while maintaining efficiency needs further investigation.
3.
The experiments are focused on MIP-based optimization tasks, but there is limited exploration of how FORGE can generalize to other combinatorial optimization problems, like SAT or constraint satisfaction problems (CSPs). Exploring other problem domains would enhance the general applicability of the model.
4.
While the paper shows that FORGE improves GUROBI, it doesn't discuss how easily it can integrate with other solvers or whether solver-specific adjustments might be necessary. More clarity on this would help understand its broader applicability in real-world use cases.
1.
How well do FORGE embeddings generalize to non-MIP combinatorial optimization problems such as SAT or CSP? Could the embeddings be applied effectively in domains beyond MIPs, and what adaptations, if any, would be needed?
2.
While the model shows promising results for problems with up to a few thousand variables, how does it scale to extremely large MIP instances (e.g., 100K variables)? Would further optimizations or adjustments to the architecture help maintain efficiency and accuracy?
3.
Can the authors provide more insights into the semantic meaning of the learned discrete codes? How can the embeddings be analyzed or visualized to better understand the underlying optimization structures they capture?
4.
How does the fine-tuning process work across various domains? Is there any risk of overfitting to a specific problem domain if fine-tuning is not handled carefully, or does the pre-training help mitigate this issue effectively? |
Fully AI-generated |
|
Forge: Foundational Optimization Representations from Graph Embeddings |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes Forge, a framework for learning unsupervised representations of Mixed-Integer Programming (MIP) instances. The core idea is to pre-train a Vector-Quantized Graph Autoencoder (VQ-GAE) on a diverse corpus of MIP instances, creating a discrete "vocabulary" of codes. These codes are used to generate instance-level and variable/constraint-level embeddings. The authors evaluate Forge in two settings: (1) unsupervised clustering of MIP instances, where it shows strong performance, and (2) supervised fine-tuning on two downstream tasks: integrality gap prediction for pseudo-cut generation and variable prediction for search guidance. The paper claims that a single pre-trained Forge model serves as a "foundational" model, generalizing across problems, sizes, and tasks, and demonstrates its utility by integrating predictions into the Gurobi solver.
1. The overarching goal of creating a general-purpose, unsupervised representation for MIP instances is highly relevant and ambitious.
2. The results in Section 4 are compelling. Forge embeddings clearly and effectively cluster unseen MIP instances from different problem families and difficulties, significantly outperforming the provided baselines. This is a strong, demonstrable capability.
1. The term "Foundational" in the title and throughout the paper is a significant overstatement. With a model size of 3.25M parameters trained on only 2,850 instances, Forge is orders of magnitude smaller in scale and scope than true foundational models (e.g., BERT, GPT). Its generality is confined to the MIP domain, and it functions more as a well-pre-trained, general-purpose graph encoder for MIPs. This framing risks creating misplaced expectations.
2. The experimental design for the supervised downstream tasks is the paper's most critical weakness. The primary baseline for solver performance is often the default Gurobi solver. For a claim of foundational generality, comprehensive comparisons against other *learning-based* methods are essential and largely missing. For integrality gap prediction, the comparison to Li et al., 2025 is conducted on different test sets (Forge on 400 problem types vs. Li et al. on 157), making the claimed superiority difficult to interpret fairly. For search guidance, the claimed SOTA is not PS-Gurobi. A more relevant and stronger baseline for a neural MIP solver is Apollo-MILP. The absence of a comparison to such a method leaves the reader uncertain about Forge's true standing.
3. The VQ component is central to the method. However, the ablation study in Appendix A.4 shows that clustering performance (NMI) is largely insensitive to the codebook size. This raises a fundamental question: Is the added complexity of VQ justified? A simpler global pooling mechanism might achieve similar results, and the paper does not provide conclusive evidence to the contrary.
4. The description of the initial node features is confusing. The text states: "For each node, we obtain a vector of size 10". However, it separately lists 4 features for constraint nodes and 6 for variable nodes. It is critical to clarify whether all nodes are represented with a 10-dimensional vector where 4/6 dimensions are zero-padded based on node type, or if the feature sets are kept separate in the model.
5. The improvement from augmenting PS-Gurobi with Forge embeddings (Figure 6) is compelling but potentially misleading. The ablation in Appendix A.8 shows that adding *random vectors* also improves performance, likely due to the increased input dimensionality and thus model capacity.
1. Given the model's relatively small scale (3.25M parameters, 2.8k training instances), do you agree that the term "Foundational" might be an overstatement? How would you define the minimum criteria for a "foundational model" in combinatorial optimization?
2. Can you provide more direct evidence that VQ is crucial for the *supervised downstream task* performance?
3. For task 1, the supervised fine-tuning datasets are quite small. Have you experimented with increasing the amount of supervised data?
4. Regarding the 10-dimensional node feature vector: could you please explicitly detail its construction? Is it a unified 10-D vector for all nodes (with zero-padding), or are variable and constraint node features processed separately by the GNN?
5. The need to design custom loss functions and prediction heads for each task suggests that the representation is not directly transferable without significant adaptation. What is your vision for how a practitioner would use Forge for a *new*, third task (e.g., cut selection or branching)? Does this process still require similar task-specific engineering, and if so, how does this align with the concept of a foundational model? |
Heavily AI-edited |
|
Forge: Foundational Optimization Representations from Graph Embeddings |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents an unsupervised framework, **FORGE**, designed to learn foundational representations for Mixed-Integer Linear Programming (MILP) problems through bipartite graph embeddings and vector quantization. The work is novel and technically sound, addressing a long-standing gap — how to construct general-purpose optimization representations without relying on solvers or labeled data. It makes a meaningful contribution to the intersection of combinatorial optimization and representation learning.
1. The paper is pioneering in its attempt to build a “foundational model” for optimization, similar to those in natural language processing (NLP) and computer vision (CV). By combining bipartite graph structures with a vector-quantized autoencoder, it achieves general structural encoding for optimization instances.
2. The model’s training does not depend on optimal solutions or solver outputs, enabling large-scale unsupervised pretraining across diverse problem distributions.
3. The authors present systematic experiments on both unsupervised clustering and supervised downstream tasks (integrality gap prediction and search guidance). They use multiple datasets, including MIPLIB, D-MIPLIB, and strIPlib, and show performance improvements when integrated with the GUROBI solver.
4. A single pre-trained model demonstrates effective transfer across different problem types, sizes, and tasks, suggesting strong generalization potential.
1. **Limited Interpretability**: The discrete “optimization vocabulary” generated through vector quantization lacks semantic clarity, making it difficult to interpret the structural meaning of each code. This limitation affects the model’s scientific transparency and trustworthiness. Providing insights into relational structures between tokens, similar to language model embeddings, would strengthen the paper’s claims.
2. **Restricted Solver Integration**: The embeddings are currently applied only once during the initialization phase (for pseudo-cut generation or variable hinting) without dynamic interaction with the branch-and-bound process. This raises questions about the model’s actual impact during the solving process.
3. **Narrow Task Coverage**: The experiments focus on two downstream tasks and do not evaluate classic optimization tasks such as variable selection or node selection. This limitation weakens the generality claim of the model.
1. Have you conducted an analysis of the semantics of the vector-quantized codes? For example, do these codes correspond to specific constraint motifs or variable structures? Providing visualizations or examples of inter-token relationships would be beneficial.
2. Is representing global features as a **frequency vector** appropriate? Since this approach ignores the graph’s edge structure, could it fail to distinguish between graphs that are structurally different but statistically similar?
3. In the pseudo-cut experiments (primal gap–time plots), would it not be more appropriate to compare against other pseudo-cut generation methods? Comparing only “with vs. without pseudo-cuts” may be insufficient, as adding cuts naturally tightens the search space and accelerates convergence.
4. Have you performed additional experiments to demonstrate the representational capacity of the tokens? For instance, using GNNs for variable selection could validate the effectiveness of the tokens.
5. Is there an ablation study showing whether downstream performance improvements are truly driven by the learned token representations? For example, do the results degrade if tokens are randomly chosen or slightly offset in the codebook?
6. Could you provide a table summarizing the optimization improvements achieved by the two experiments across different datasets and problem scales? This would help in better understanding the practical impact of your work. |
Moderately AI-edited |
|
Forge: Foundational Optimization Representations from Graph Embeddings |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work proposes a method for learning embeddings for MIP problem instances using graph embeddings. Each MIP is encoded as a bipartite graph, with variables and constraints making up the two partitions, respectively. A GNN embedding is then trained, which is converted to a quantized representation with a codebook, and then decoded to minimize a reconstruction loss. In an unsupervised setting, this approach is demonstrated to effectively cluster unseen problem instances. In a supervised setting, pre-trained FORGE models are fine-tuned for specific problem types and shown to outperform Gurobi without FORGE, as well as SOTA ML methods.
The paper is well written and easy to follow. The motivation and problem space are clearly articulated, and constitute an important area of research. The results are strong.
I would appreciate details of the loss function in the paper itself. At the very least, the notation that is used in the main text should be introduced outside of the appendix.
Typo line 338 “results” -> “result”
The discussion of limitations reads mainly like a list of strengths and future work. Do the authors anticipate any drawbacks to this method? Are there problems that are typically cast as MIP instances that are likely to be difficult using FORGE?
How expensive is fine-tuning compared with training comparator ML methods? This tradeoff seems important for those who are considering FORGE compared to a method designed for specific CO problems. |
Fully human-written |