|
When Small Models Team Up: A Weak‑Expert Ensemble Surpassing LLMs for Automated Intellectual‑Property Audits |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper explores the enforcement of Intellectual Property Rights on e-commerce platform. It argues that breaking down the identification process into smaller partially independent tasks and giving them to specialized experts is (1) cheaper and (2) better performing than using a single unified foundational model for making the decision. In specific, the identification process is split among four domain experts and the final one which based on the previous expert's outputs makes a single Approve/Reject decision. This pipeline is named Multi-Agent Audit Framework (MAAF), and two variations of it are proposed, in the first variation, $\text{MAAF-}\alpha$, foundational LLM and VLM models serve as experts, while in the other $\text{MAAF-}\beta$ specialized smaller models serve the same purpose. The authors also fine-tune Qwen2.5 model to serve as a dedicated comprehensive expert named IPT-GPT. This model is fine tuned on IPR-Audit dataset introduced in the paper and uses audit reversal data augmentation during the training process to improve robustness and consequently performance of the resulting model on the given task. The results show that using smaller specialized experts in $\text{MAAF-}\beta$ almost always outperforms using bigger foundational models as experts in $\text{MAAF-}\alpha$ across a variety of models. Moreover, combining $\text{MAAF-}\beta$ with IPT-GPT results in the best average performance over two accuracy based metrics compared to non fine-tuned models used in place of the comprehensive expert.
The problem is well motivated. The focus on robustness through audit reversal data augmentation is a significant and important idea. There is a number of internal baselines and variations of the framework. The authors construct a dataset to facilitate training and evaluation of their approach. The work considers a number of LLMs in their evaluation.
There is a lack of external baselines. While the authors state that there is a lack of prior work on IPR, they clearly cite some of it intended for single modality in Section 2.1. Why not take a single state of the art approach out of those listed in the given section and use it as a baseline to see how this work ranks among existing body of research? One would expect that multimodal nature of this work would give it a significant advantage and show that moving forward multimodality is the way to go. In line with this, what's the reasoning for not evaluating the approach on already available datasets so that the results can be compared with the existing work. Moreover, if there is no intention of publicly releasing IPR-Audit dataset, in isolation the evaluation results would not mean much as they would not be comparable with other past or future methods.
What is the reasoning for focusing exclusively on accuracy based metrics in the evaluation? Confusion matrix based metrics (such as recall, precision, TPR, FPR and F1 score) are much more informative for this use case, especially when the prevalence of infringement cases in the test data is not specified. On that note, could it be possible to list BFI and TMI pass@1 of classifiers that always return Approve/Reject respectively? It would help to see how the proposed methods compare even to these "dumb" classifiers.
The arguments presented in 4.2 about tensor swapping are relatively weak, as the two parts of $\text{MAAF-}\alpha$ can be done in batches one after another rather than subsequently for each sample (or using separate GPUs for each part of the pipeline). Moreover, even reasoning expert alone takes a substantial amount of memory of the GPU, which may cause tensor swapping or out of memory error as key value store size increases during generation and activation requires additional memory on top of the model weights. Do you have any insights how quantized version of these reasoning models behave?
There are some minor presentation concerns which give off a feeling that the manuscript was rushed and prematurely submitted:
- Missing reference on line 156
- Missing character on line 93
- Unclear notation in Equations (2) and (3) - What are $b$ and $B_t$?
- It is unclear what is LogoBank on line 207. Could it be cited if it's an external work?
Why is the exemption expert needed in the form it is presented, it appears to be something that can be programmatically checked as a simple if/else condition and skipping the whole classification procedure straight to the answer in some cases without needing the comprehensive expert at all?
It is unclear how the law expert takes the law into account, could you expand on this?
In audit reversal data augmentation, how is $D_\text{rev}$ determinated, was it annotated by a human?
Can you expand on why the framework would lose functionality if you did an ablation study of excluding specific experts? From my understanding their inputs are just included in the input of comprehension expert, so excluding some information wouldn't make much difference from the engineering standpoint. Considering that IPR-GPT is fine-tuned on specific inputs, it can be excluded in that study and focus on $\text{MAAF}-\beta$ instead. This is important in order to see how important each modality is for the given framework and whether there are redundancies in the proposed solution.
IPR-Audit dataset appears to be a significant contribution. However could the prevalence of infringement cases in the dataset be listed. Is it intended to release the dataset publicly to the community? |
Fully human-written |
|
When Small Models Team Up: A Weak‑Expert Ensemble Surpassing LLMs for Automated Intellectual‑Property Audits |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper addresses the challenges of Intellectual Property Rights (IPR) enforcement on e-commerce platforms, where traditional manual review and unimodal AI models are limited by scalability and adaptability. It proposes IPRGPT, a novel multimodal framework integrating two core components: the Multi-Agent Audit Framework (MAAF) and the Audit Reversal (AR) Data Augmentation mechanism. MAAF, implemented in two versions (MAAF-α using LLMs/VLMs and MAAF-β using lightweight models), decomposes IPR auditing into specialized subtasks assigned to dedicated "expert" modules. AR enhances model robustness by generating adversarial training cases via reversing expert decisions. The paper also introduces the IPR-Audit dataset, focusing on two key violations: Brand Field Inconsistency (BFI) and Trademark Infringement (TMI). Extensive experiments show that IPRGPT (especially with MAAF-β) outperforms state-of-the-art LLMs/VLMs, achieving up to 24.26% performance improvement, while reducing GPU memory usage by over 60% and inference latency. It challenges the notion that larger multimodal LLMs are inherently superior, demonstrating that a purpose-built ensemble of lightweight experts delivers higher accuracy at a lower cost.
- Innovative Multi-Agent Framework Design: MAAF’s decomposition of IPR auditing into specialized, modular experts (Image, Information, Exemption, Law, Comprehensive) enables targeted handling of multimodal data (text/images/metadata), addressing the limitations of unimodal models and monolithic VLMs.
- Efficient Performance-Resource Trade-off: By combining lightweight models (e.g., YOLOv8, BERT-based NER) and AR data augmentation, IPRGPT achieves superior accuracy (82.49% average pass@1) while reducing GPU memory demand by ~50% and inference latency by ~2.5x compared to VLM-based MAAF-α, making it scalable for real-world e-commerce.
- Empirical Validity with Dedicated Benchmark: The introduction of the IPR-Audit dataset (with expert annotations for BFI/TMI) fills a gap in IPR auditing research, and rigorous experiments (ablation studies, cross-model comparisons) robustly validate the framework’s effectiveness.
- Limited Dataset Generalizability: The IPR-Audit dataset (1,837 samples) is small in scale and collected from a single unspecified e-commerce platform over one year, lacking diversity across regions, product categories (e.g., luxury goods, pharmaceuticals), and IPR violation types (e.g., patent infringement), limiting the framework’s generalization to global or niche e-commerce scenarios.
- Opaque Expert Collaboration Logic: The paper does not detail how the Comprehensive Expert synthesizes outputs from other experts (e.g., weight assignment to different experts, conflict resolution for contradictory expert decisions), reducing the framework’s interpretability and reproducibility.
- Neglect of Dynamic IPR Regulation Adaptation: E-commerce IPR regulations vary by jurisdiction and evolve over time (e.g., new exemption rules for second-hand digital goods), but IPRGPT’s rule-based Exemption/Law Experts lack mechanisms to automatically update to regulatory changes, requiring manual intervention.
- Insufficient Comparison with Domain-Specific SOTA: While the paper compares IPRGPT to general LLMs/VLMs (e.g., GPT-4o, Gemini 1.5), it fails to benchmark against existing domain-specific IPR auditing models (e.g., specialized trademark detection systems or e-commerce compliance tools), making it hard to contextualize its novelty in the broader IPR research landscape.
- **This paper appears to be generated by LLM, but there is no LLM usage statement.**
- The small sample size (1,837) of the IPR-Audit dataset may lead to overfitting, as the model may learn platform-specific patterns rather than general IPR violation characteristics, reducing its applicability to other e-commerce platforms.
- The lack of dataset diversity (e.g., no coverage of non-physical goods like digital software or regional IPR variations) means IPRGPT may fail to detect violations in underrepresented categories or jurisdictions.
- Unclear synthesis logic of the Comprehensive Expert hinders reproducibility, as other researchers cannot replicate how expert outputs are aggregated to form final audit decisions.
- The absence of dynamic regulatory adaptation means IPRGPT requires continuous manual updates to comply with new IPR laws (e.g., EU’s Digital Services Act), increasing operational costs and delaying enforcement.
- The failure to compare with domain-specific SOTA models leaves uncertainty about whether IPRGPT’s performance gains are incremental or transformative relative to existing IPR-focused tools.
- MAAF-β’s reliance on precomputed embeddings for lightweight models may limit real-time adaptability, as updating embeddings for new brands or logos requires retraining, slowing response to emerging counterfeit tactics.
- The paper does not evaluate IPRGPT’s performance under adversarial attacks (e.g., sophisticated logo manipulations or text obfuscation), leaving gaps in understanding its robustness in real-world scenarios where counterfeiters actively evade detection.
- **This paper lacks an ethics statement, an open-source statement, and an LLM usage statement, which is a clear violation of the ICLR 2026 submission policy.** |
Fully AI-generated |
|
When Small Models Team Up: A Weak‑Expert Ensemble Surpassing LLMs for Automated Intellectual‑Property Audits |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a new approach to the classification task of determining the legality of product listings on e-commerce platforms. While current state-of-the-art methods rely on monolithic LLMs or VLMs, the paper introduces a system composed of several smaller models, each specializing in a specific aspect of the audit process (brand recognition, authorization verification, legal rule evaluation). Each sub-model is lightweight, although the framework still leverages Qwen-72B, which is not particularly small. Experimental results show a modest improvement in accuracy — 82.5% compared to 81.5% achieved by O1 Mini.
* Experiments defend quite strongly that we can do good results with less compute.
* Good experiment method (fine tuning and RL seem well conducted)
- Performance gain is very weak: 81.5% for o1 mini to 82.5%. We could easily imagine that last gpt or claude could outperform these results.
- Does not adress adversarial attack robustness which is one of the main issue mentionned in the related work section.
- Does not quantify the compute efficiency gain when mixture of lightweight vs monolithcs VLMs LLMs
- Very poor clarity
- No mention to figures, are we supposed to read them on the fly?
- Figure 1: what are the two values separated by ‘/’ ?
- Table 2 appears before table 3 while being reffered after Table 3
- Inconsitent use of MAAF: “To evaluate this, we compare VLM-based implementations
MAAF-α with traditional vision models within the MAAF framework MAAF-β.”
You use MAAF for both your mixture approach **and** regular approach that seems to come from related works. This mixing of terms is confusing.
- $B_t$ is not defined line 230 (eq (3))
- $E_\text{img}$ is not defined in eq (5), *select* is not described at all
- A lot of technical details in the main paper that could have been put in appendix for readability sake.
- Table 3 and 4: the highest values are bolded only if they are from your approach…
- Table 3: why putting MAAF-beta to compare with MAAF-alpha and not MAAF-alpha ?
- Citation error at line 156
- Weak related works section
- The best contribution of this paper is that it shows that we can have similar results with less compute. But this is not how currently exposed which is an essential epistemic limitation.
- One of the main concern exposed was about adversarial attacks (detection evasion). What in this paper adresses this?
- It could be good to have a figure illustrating the compute efficiency between big monolithic approaches and yours.
- IPR-GPT without Audit Reversal (AR) augmentation → is it with MAAF - beta ?
- “Pass@1” in this paper seems to just mean **accuracy**, is that right? what do you mean by @1 here? |
Fully human-written |
|
When Small Models Team Up: A Weak‑Expert Ensemble Surpassing LLMs for Automated Intellectual‑Property Audits |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes IPR-GPT to address challenges in Intellectual Property Rights (IPR) enforcement on e-commerce platforms. To address the limitations in human-dependent or unimodal AI systems, IPR-GPT proposes a Multi-Agent Audit Framework (MAAF) to combine multi-modal specialized experts in visual analysis, textual reasoning, legal compliance, and exemption handling, and an Audit Reversal (AR) data augmentation mechanism to enhance model generalization by synthesizing challenging classification cases. Experiment results indicate that IPR-GPT performs some large-scale models with reduce costs.
Experiment results prove the effectiveness of the proposed multi-modal frameworks and the combination of the small-scale expert models, since it can outperform many large-scale models on the task.
- This paper is quite incremental and lacks novelty. The multiple agent framework has been adopted by multiple tasks in some other domains. The MAAF of IPR-GPT only intuitively introduce a brand detector to introduce the image-modal information, limiting its technical contribution.
- Unclear source of performance improvement of IPR-GPT compared to large-scale models. The Image Expert is trained with Brand Detection Datasets, which introduce extra external knowledge of brand information to the framework. Whether large-scale models have learned these knowledge remains unknown. Meanwhile, IPR-GPT does not conduct experiments to introduce similar information to the large-scale models. Therefore, it is not that sound to directly attribute the performance improvement to the design of MAAF/AR but ignore the introduce of more significant external knowledge. Therefore, I think the performance improvement contribution of IPR-GPT is unclear and may be potentially over-estimated.
Please see Weakness Section above. |
Fully human-written |