|
Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces Conservative Discrete Quantile Actor-Critic (CDQAC), an offline reinforcement learning algorithm for job shop and flexible job shop scheduling. CDQAC aims to learn high-quality scheduling policies directly from static datasets generated by suboptimal heuristics such as random, genetic algorithm (GA), or dispatching rules, without relying on simulated environment interaction. The method combines a quantile-based critic, a delayed policy update, and a dueling network structure to model the return distribution conservatively and stably. Experiments on JSP and FJSP benchmarks show that CDQAC outperforms both existing offline and online RL baselines while maintaining strong generalization to larger instances.
The empirical results are comprehensive across JSP and FJSP benchmarks, demonstrating scalability and transferability. The finding that random datasets yield superior generalization provides a novel perspective on data diversity and pessimism in offline RL. Overall, the paper is clearly written, methodologically solid, and offers meaningful insights for applying offline RL to combinatorial optimization.
Despite the strong presentation, the empirical evaluation leaves several open concerns.
- The reliance on datasets generated from suboptimal heuristics raises doubts about the true generality of the learned policy, as performance improvements may stem from overly simple environments rather than genuine learning from poor data.
- The comparison set for both offline and online RL baselines is narrow—especially for online methods—and lacks sufficient training details, making fairness unclear.
- The claim that CDQAC trained on random data outperforms fully trained online PPO-based methods is difficult to accept without deeper evidence of equal training conditions or convergence verification.
- Similarly, the demonstration that five training trajectories suffice for good performance may reflect limited state diversity rather than strong sample efficiency.
- The justification for using datasets generated by suboptimal heuristics is not sufficiently supported—how does training on random or poor-quality data yield policies that outperform those derived from expert or online learning methods?
- The experimental comparison includes very few offline RL baselines.
- Section 5.3 lacks details on the training configuration for online RL methods; it is unclear whether the models were fully trained to convergence under comparable computational budgets.
- The result that well-trained online RL performs worse than offline RL trained on static random data seems counterintuitive—additional diagnostic analysis could strengthen the argument.
- The conclusion that five trajectories suffice for strong generalization appears overly optimistic; more analysis is needed to show whether those trajectories capture meaningful dynamic diversity.
- The composition of the state vector is underexplained—what specific scheduling features (machine utilization, remaining processing time, job priority, etc.) are encoded, and how do they influence performance?
- The paper does not compare against classical dispatching rules or metaheuristics widely used in scheduling; including such baselines would improve the relevance and credibility of the results. |
Heavily AI-edited |
|
Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Conservative Discrete Quantile Actor-Critic (CDQAC), a novel offline reinforcement learning algorithm designed for job shop and flexible job shop scheduling problems. CDQAC learns effective scheduling policies directly from datasets generated by suboptimal heuristics, eliminating the need for simulated environment training. The method leverages a quantile-based critic and delayed policy updates to estimate return distributions and improve stability. Extensive experiments show that CDQAC consistently outperforms both the heuristics that generated the training data and state-of-the-art offline and online RL baselines, achieving high sample efficiency and strong generalization—even when trained on data from random heuristics.
- CDQAC can learn highly effective scheduling policies even when trained only on data generated by random or suboptimal heuristics, demonstrating strong generalization capabilities
- The method eliminates the need for simulated environment interactions, making it practical for real-world scheduling problems where simulators may not be available
- The algorithm consistently outperforms both the heuristics used to generate the training data and state-of-the-art offline and online RL baselines across various scheduling benchmarks
- CDQAC achieves strong results with limited data, highlighting its efficiency and practicality for offline RL in industrial scheduling contexts
I am not sure whether the below points are actual weaknesses of the paper or whether they arise due to the lack of my knowledge of scheduling problems and related literature - I am happy to increase the score if these can be addressed & I will give myself a lower confidence to account for this.
1) The main reported metric is the optimality gap - i.e. the gap to some method that is even better. The immediate question is of course why not use the better method - if I am not mistaken then this one would take 30 mins to provide a better solution for any of the considered problems - 30 mins does not sound like a lot, especially when considering RL training time is likely higher (I believe it is not reported).
2) If I assume the better method is for some reason not applicable, i.e. because 30 mins is too long, then probably the training time of the RL algorithm does not matter as much - the time needed to solve is reported for some problems within seconds, which is likely only the inference time. I wonder then why training time does not seem to matter - maybe the policy can be reused across many new instances, while the "classical" solution needs to start from scratch always, but I am not sure.
3) Somewhere along the same lines: It seems that the main reason heuristics and RL methods are considered in these settings in the first place is because classical approaches do not scale to large problem sizes / complexities. I wonder if sufficiently large problem instances have been considered in the paper then, since any problem is solved better by a classical approach within just 30 minutes, which again does not sound like a lot (also 30 minutes is a relatively weak quantification; I don't know whether this uses the same GPU hardware, better would be to report FLOPS used by both methods). I presume it would make sense to consider a problem size where the best method "switches", i.e. where the optimality gap becomes negative since the classic approach cannot come up with a reasonable solution within 30 minutes, while the RL solution can.
I guess generally the paper could use a bit more clarifications on the constraints and requirements that arise in practice from these scheduling problems - at least for me some things remain unclear.
see weaknesses |
Fully human-written |
|
Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents Conservative Discrete Quantile Actor-Critic (CDQAC), an offline reinforcement learning algorithm for Job Shop Scheduling Problems (JSP) and Flexible Job Shop Scheduling Problems (FJSP). The method learns scheduling policies from static datasets without requiring environment interaction, using a quantile critic with dueling architecture and conservative Q-learning to prevent overestimation of out-of-distribution actions. CDQAC employs delayed policy updates and a Dual Attention Network to encode operations and machines. The algorithm demonstrates superior performance compared to offline and online RL baselines across standard benchmarks, achieving sample efficiency with only 10-20 training instances. The research reveals that CDQAC performs best when trained on random solutions rather than expert heuristics, attributed to higher state-action coverage (7.7× more than priority dispatching rules) enabling better solution stitching. The method generalizes to larger problem instances and maintains stability even with 1% of original training data, challenging conventional offline RL assumptions about data quality requirements for combinatorial optimization problems.
1. A good summary of related work in the area, and setting the context of the relevance of the proposed work clearly
2. Learning from offline trajectories, especially lower quality ones like random or genetic algorithms, is an important problem tackled in the paper.
3. The problem setting is explained clearly, the evaluations use state-of-the-art algorithms as baseline
4. Expansion from job scheduling to flexible job scheduling is significant. Expansion to varied number of state-action spaces, although not new, is both a good characteristic of the solution, and its generalizability has been evaluated, limitations have been noted.
1. An ablation study is missing. The proposed solution has many different components, and it is unclear which of these components are critical to the overall performance.
2. Data used is synthetic, the experiment results will be stronger if real-world data were used.
1. What is the computational requirements, training time, memory usage, and inference latency?
2. How were the hyper-parameters determined?
3. Why do you need to use to asymmetric Huber loss? In what way is it asymmetric? Have you compared it with the standard L2 loss?
4. What is the purpose of the dueling quantile network? Would the algorithm still be performant if we remove it?
5. What is the sensitivity to the number of quantiles you use? What happens if you reduce the quantiles to as few as 3 or 4? |
Fully human-written |
|
Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes Conservative Discrete Quantile Actor-Critic (CDQAC), a conservative distributional offline RL method for learning dispatching policies in JSP/FJSP. CDQAC combines a quantile critic with conservative value regularization and delayed policy updates to stabilize the learning process. The offline approach reports strong performance on FJSP/JSP benchmarks and the notable observation that policies trained on random datasets can sometimes outperform training on near-optimal trajectories.
This paper addresses exploration-free policy learning (offline RL) for large FJSP/JSP instances by combining distributional Q-learning with conservative regularization, dueling Quantile network, and delayed policy update.
It demonstrates sample efficiency and generalization across instance sizes in FJSP, one of the most challenging COPs.
The finding that random datasets can be effective for training is practically meaningful for dataset construction in scheduling.
JSP comparisons are incomplete and relatively weak. While FJSP results include several recent methods, JSP needs stronger baselines. In particular, compare against imitation learning approaches that learn machine–operation pairing policies without a simulator; the following work achieves SOTA on JSP:
- Lee, J. H., & Kim, H. J. (2024). Graph-based imitation learning for real-time job shop dispatcher. IEEE T-ASE.
Modern online RL baselines should be included as JSP baselines (not only in Related Work). For example:
– Park, J., Bakhtiyar, S., & Park, J. (2021). ScheduleNet. arXiv:2106.03051.
– Chen, R., Li, W., & Yang, H. (2022). TII 19(2):1322–1331.
– Iklassov, Z. et al. (2023). IJCAI, pp. 5350–5358.
These papers report the performance on the Taillard (1993) datasets.
The integration of quantile critic + dueling + conservative regularization + delayed policy updates is presented, but its novelty and motivation are under-emphasized. If this combined offline scheduling is first in this domain, authors need to state it explicitly, explain what had to be modified to make the components work together, and add ablations isolating each component’s effect, not only about quantile critic and dueling (Table 9).
The encoder adopts the DAN/DANIEL structure (Wang et al., 2023) and operation/machine features largely as is.
1. As you note that CDQAC can outperform the heuristic that generated its training data, do policies trained on optimal schedules achieve near-optimal quality on the same instances?
2. Why can policies trained on random datasets outperform near-optimal data? Imitation learning is sensitive to dataset quality, yet CDQAC sometimes shows the opposite trend.
3. Against modern online RL and imitation learning, how close is CDQAC in solution quality?
4. Could you provide learning curves of CDQAC across training-set size and composition (random/PDR/GA/near-optimal mixtures)?
Typo
“Learning-based methods for Scheduling Problems” → capitalization (L80)
“Demirkol 80x20” → it may be 50x20 (L865) |
Lightly AI-edited |