GARDO

GARDO: Reinforcing Diffusion Models without Reward Hacking

Haoran He^1,2 Yuxiao Ye¹ Jie Liu³ Jiajun Liang² Zhiyong Wang⁴ Ziyang Yuan² Xintao Wang² Hangyu Mao² Pengfei Wan² Ling Pan¹

¹Hong Kong University of Science and Technology ²Kling Team, Kuaishou Technology ³CUHK MMLab ⁴The University of Edinburgh

arXiv Code Threads

Abstract

Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment. However, since precisely specifying a ground-truth objective for visual tasks remains challenging, the models are often optimized using a proxy reward that only partially captures the true goal. This mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses. While common solutions add regularization against the reference policy to prevent reward hacking, they compromise sample efficiency and impede the exploration of novel, high-reward regions, as the reference policy is usually sub-optimal. To address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking, we propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO), a versatile framework compatible with various RL algorithms. Our key insight is that regularization need not be applied universally; instead, it is highly effective to selectively penalize a subset of samples that exhibit high uncertainty. To address the exploration challenge, GARDO introduces an adaptive regularization mechanism wherein the reference model is periodically updated to match the capabilities of the online policy, ensuring a relevant regularization target. To address the mode collapse issue in RL, GARDO amplifies the rewards for high-quality samples that also exhibit high diversity, encouraging mode coverage without destabilizing the optimization process. Extensive experiments across diverse proxy rewards and hold-out unseen metrics consistently show that GARDO mitigates reward hacking and enhances generation diversity without sacrificing sample efficiency or exploration, highlighting its effectiveness and robustness.

Takeaways

👉 Takeaway 1: Regularization is not universally required. It is required only for samples with spurious proxy reward.

👉 Takeaway 2: A static reference model inevitably becomes a con- straint on RL optimization. Dynamically updating the reference model facilitates prolonged improvement.

👉 Takeaway 3: Multiplicative advantage reshaping exclusively within positive samples enables robust diversity improvement.

Image 1 description — Based on our findings, we introduce Gated and Adaptive Regularization with Diversity-aware Optimization (**GARDO**). GARDO employs an uncertainty-driven, gated KL mechanism to control the proportion of regulariza- tion, avoiding unnecessary penalties. Our proposed diversity-aware advantage shaping effectively encourages exploration of novel states.

Results

example image — Learning curves and o.o.d. generalization results across GARDO and baselines when trained on GenEval task and OCR task. GARDO not only matches the sample efficiency of the KL-free baseline, but also mitigates reward hacking effectively, as evidenced by the superior performance on unseen metrics.

Visual Comparison

We provide the generated images along the training process. As the training step increases, we ob- serve that Flow-GRPO obviously hacks the reward ( or exploits the flaws), yielding reduced perceptual visual quality. However, GARDO remains a high visual quality throughout the training process, without compromising optimization performance on the proxy reward

BibTeX

@misc{he2025gardo,
        title={GARDO: Reinforcing Diffusion Models without Reward Hacking},
        author={Haoran He and Yuxiao Ye and Jie Liu and Jiajun Liang and Zhiyong Wang and Ziyang Yuan and Xintao Wang and Hangyu Mao and Pengfei Wan and Ling Pan},
        year={2025},
        eprint={2512.24138},
        archivePrefix={arXiv},
        primaryClass={cs.LG}
    }