|
Matched-Pair Experimental Design with Active Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The work proposes a new experimental setting to consider within the context of detecting the existence of a treatment effect. Additionally, the authors propose a new method based on an active learning design in order to perform this detection task.
**Clarity:** This work does a good job of providing the necessary background for the problem for readers with an ML background than a clinical background. Additionally, it seems that all notations and derivations are correct. The authors acknowledge the usage of LLMs for polishing the writing.
**Significance:**
1. This work is able to theoretically show that their method has greater label efficiency than passive learning techniques where the trial participants are randomly split into groups.
2. It provides insight into medically relevant problems and considerations that exist within the context of active experimental design. There are also preliminary experiments to validate the effectiveness of their method.
The main weaknesses in the paper stem from motivational and experimental rigor. The motivational issues translate to a seeming lack of novelty. This may partly be because of the introduction and related work being squeezed into < 1.5 pages.
**Novelty:**
I believe that the authors do not do a good job of distinguishing their contribution from prior work. For example, the authors mention the existence of a separate field known as the estimation of the conditional average treatment effect (CATE). The authors explicitly define CATE as quantifying the treatment effect size with a limited sample. The authors instead claim their work is more about detecting the existence of a treatment effect. Semantically, both lines of work can be considered as the same concept as the end result is understanding how a treatment impacts different subgroups of populations. I understand that CATE may be more of a post hoc analysis rather than the sequential online setting of this paper. However, if that is the case, then it should be possible to analyze how these CATE methods perform if molded into the framework of a sequential design which is not seen in the experimental setting.
The authors also base their method as an alteration on top of the Robust CAL active learning approach. Specifically, they identify that their method does an additional sampling step into the positive agreement region of a set of classifiers. This can be a significant contribution, but there is no explicit analysis that distinguishes their method from that of Robust CAL. Specifically, there is no intuition about the benefit of sampling from the positive agreement region and in fact they show that it results in lower label efficiency compared to the Robust CAL algorithm by itself. Their theoretical analysis compares against passive learning, but an in depth comparison against Robust CAL would do more to highlight their own novelty. This problem is further exacerbated by the lack of a comparison against Robust CAL in their empirical results section.
**Motivation:**
In general, I think the novelty issue is partially due to a lack of motivation regarding the significance of their method itself. For example, their opening figure shows that their method is enrolling targets from a superset of targets, but their isn’t an analysis or any intuition that makes it clear why sampling over this larger region is better. In fact, I wonder if this sampling strategy has potential downfalls for sampling outside the region where the treatment effect takes place.
There also isn’t specific intuition regarding the choice of adding onto the Robust CAL algorithm. There are many active learning frameworks such as BALD, CoreSet, entropy, etc. What is it about Robust CAL specifically that motivated its choice for this very specific medical context problem? The entire label complexity analysis is based on this paper, but I am still unsure why it was chosen.
**Experimental:**
The problems of Novelty and Motivation are also highlighted by the choice of experiments with respect to their method. The authors claim their method is an improvement on Robust CAL, but it doesn’t compare against it in any of the experiments. Instead it compares against regular MPED, a regression based active design strategy, and tau-BALD. This limits how much we can assess the novelty of the method in this paper, because it isn’t clear how the base strategy would have done. It would have also been natural to compare against standard active learning strategies such as Coreset, entropy, margin sampling, and other strategies that commonly appear in the literature. It isn’t clear to me why these methods could not have been adapted to this setting and given a fair basis of comparison.
I would have also been curious to establish differences amongst the selected targets that led to the improvement we observe empirically. For example, how did the samples with MPED-RobustCAL differ from those selected by MPED and can we visually or analytically show what covariates were sampled compared to those from the base algorithm.
I also think certain natural ablation studies are missing from this paper. For example, I understand that it is difficult to acquire data suitable for the MPED task, however, it would be good to perform a more comprehensive assessment with the synthetically generated data. For example, in the synthetic data, we could vary the treatment effect threshold and establish some intuition about how the method performs under different gamma values. Furthermore, it would have been good to establish how certain choices like s=0.5 corresponds to different real world clinical experimental conditions.
1. Why wasn’t their a comparison with several active learning strategies? Is there something specific about BALD that admits its usage here, but not others?
2. Why wasn’t Robust CAL a comparison in the plots? I felt that would be a natural comparison.
3. Is there a way to intuitively demonstrate the advantages of sampling from both DIS and POS. I don’t fully understand why this implies enclosing the target region. I also want to see how this idea of DIS and POS evolves over training and potentially how this relates to certain distributions of covariates. Potentially, this would be a learning dynamics study of the 10 classifiers in the ensemble. |
Fully human-written |
|
Matched-Pair Experimental Design with Active Learning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper presents MPED-RobustCAL, an active learning-based experimental design that demonstrates clear advantages over conventional passive learning and standard MPED approaches. Through extensive experiments on both synthetic and real-world datasets such as ALS and IHDP, the authors show that MPED-RobustCAL achieves higher testing power, improved precision, and faster convergence in identifying regions with high treatment effects. The method is robust across different classifiers and label budgets, and the theoretical analysis is thorough, with proofs supporting the main claims. The inclusion of sensitivity analyses and comparisons with other active learning baselines further strengthens the empirical evaluation.
However, the paper also has some limitations. The performance of MPED-RobustCAL, while generally superior, can be affected by label noise and imperfect ground-truth identification in real-world datasets, which may prevent the true positive rate from reaching its theoretical maximum. Additionally, the learning efficiency of MPED-RobustCAL is sometimes lower than that of the original RobustCAL due to extra label queries required for two-sample testing. The method’s reliance on certain hyperparameters and the need for careful tuning may also limit its practical applicability in some settings. Overall, while the approach is promising and well-supported, further work on reducing sensitivity to noise and hyperparameters would enhance its impact.
na
na
na |
Fully AI-generated |
|
Matched-Pair Experimental Design with Active Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes MPED-RobustCAL, an active-learning framework for matched-pair experimental designs (MPEDs) aimed at detecting (rather than estimating) treatment effects when the overall ATE is small but there exists a responsive subregion Ω_γ. The key idea is to recast “find the responders” as a binary classification task.
- Kato, Oga, Komatsubara Inokuchi "Active adaptive experimental design for treatment effect estimation with covariate choice." ICML2024
This study proposes an innovative approach to detecting treatment effects in matched-pair experimental designs (MPEDs) and provides sufficient theoretical results, supported by empirical analyses.
I am interested in the relationship between treatment-effect estimation and treatment-effect detection. For example, Kato et al. (2024) present an active, adaptive experimental design for efficient estimation of the average treatment effect. Although the experimental settings differ, the two lines of work are closely related in that both study active, sequential designs for treatment effects. Can the method proposed here be applied to that setting, and what is the precise relationship between them?
See Weakness section. |
Lightly AI-edited |
|
Matched-Pair Experimental Design with Active Learning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces a framework called Matched-Pair Experimental Design with Active Learning (MPED-RobustCAL), which integrates agnostic active learning method RobustCAL into treatment-responder detection in matched-pair experimental designs. The motivation arises from the challenge that when the overall treatment effect in a population is small, conventional matched-pair designs may lack sufficient statistical power due to inefficient random sampling. The proposed approach reformulates the identification of high treatment-effect regions as a classification problem and employs an active learning strategy adapted from the RobustCAL algorithm to sequentially select participant pairs for experimentation. The authors provide theoretical guarantees showing that the enrollment region generated by MPED-RobustCAL encloses and converges to the true high-effect region, with improved label efficiency compared to passive learning, while maintaining statistical validity through Type I error control. Experiments conducted on synthetic data and two real-world datasets demonstrate that MPED-RobustCAL achieves higher testing power and true positive rates than conventional matched-pair designs and existing active designs such as regression-based active design and Bayesian active learning methods. The paper presents a theoretically supported and empirically validated method for improving the efficiency and effectiveness of matched-pair experimental designs to select regions with high treatment effects with limited experimental budgets.
The paper addresses the need of active learning in matched-pair experimental design, which is a reasonable concern. It introduces a framework that integrates Matched-Pair Experimental Design with active learning method RobustCAL to improve sampling efficiency. The theoretical analysis offers sounding guarantees on label efficiency, convergence, and statistical validity. The experimental results, conducted on both synthetic and real-world datasets, show consistent performance improvements in testing power and true positive rates.
- The novelty of the proposed method should be further clarified, as it primarily adapts existing active learning strategies to the matched-pair setting, with theoretical development largely following prior work on RobustCAL.
- Some assumptions, including those related to the data noise structure and the quality of matching, appear to be idealized and may not fully reflect realistic experimental conditions.
- Several parts of the paper, particularly the explanation of why the proposed method works, are not sufficiently clear for non-expert readers and would benefit from a more accessible presentation.
- It is recommended that the authors provide a more detailed explanation of the novelty of the proposed method, particularly highlighting any key challenges required to apply RobustCAL to the Matched-Pair Experimental Design (MPED) setting.
- The paper appears to conflate individual outcomes with outcomes conditional on covariates. The hypothesis test in equation (2) seems to concern the equivalence of conditional outcome distributions $Y \mid a$ given treatment, which are not equivalent to the distributions of potential outcomes—a fundamental issue in causal inference. Since potential outcome distributions are generally unidentifiable, the paper would benefit from clearer terminology and a more precise explanation of what the key hypothesis test is evaluating.
- Assumption 4.4 is quite strong, as it may not hold in common cases where $\tilde{\eta}(\mathbf{X})$ is continuous in $\mathbf{X}$. It would be helpful if the authors could provide examples or scenarios where this assumption is valid.
- For Algorithm 1, clarification is needed on how the minimum in line 9 is computed. Would performing this search be computationally expensive in practice?
### Miscellaneous
- The phrase *treatment effect size* in line 96 is used ambiguously and may be misinterpreted as referring to the sample size. Replacing it with *value* or *magnitude of the treatment effect* is recommended.
- The notation $k_n$ in line 140 should likely be $k$.
- The term *corresponding outcomes* in line 196 is unclear. Do $(Y^0, Y^1)$ represent the potential outcomes of the unit with covariate $\tilde{\mathbf{X}}$ or of its matched pair $\tilde{\mathbf{X}}'$? |
Lightly AI-edited |