E

v

o

S

e

a

r

c

h

Scaling Image and Video Generation via Test-Time Evolutionary Search

Haoran He1,2    Jiajun Liang2    Xintao Wang2   
Pengfei Wan2    Di Zhang2    Kun Gai2    Ling Pan1   
1Hong Kong University of Science and Technology    2KwaiVGI, Kuaishou Technology
·

We propose Evolutionary Search (EvoSearch), a novel and generalist test-time scaling framework applicable to both image and video generation tasks. EvoSearch significantly enhances sample quality through strategic computation allocation during inference, enabling Stable Diffusion 2.1 to exceed GPT4o, and Wan 1.3B to outperform Wan 14B model and Hunyuan 13B model with 10✖️ fewer parameters.

Image 1 description
Method
EvoSearch introduces a novel perspective that reinterprets the denoising trajectory as an evolutionary path, where both the initial noise \(x_T\) and the intermediate state \(x_t\) can be evolved towards higher-quality generation, actively expanding the exploration space beyond the constraints of the pre-trained model's distribution. Different from classic evolutionary algorithms that optimize a population set in a fixed space, EvoSearch considers dynamically moving forward the evolutionary population along the denoising trajectory starting from \(x_T\) (i.e., Gaussian noises).

Image 1 description

EvoSearch develops different mutation operations for initial noises and intermediate denoising states:

Initial noise mutation: The following mutation operation is designed to preserve the Gaussian nature of the noise

\[x^{\rm child}_T = \sqrt{1-\beta^2}x_{T}^{\rm parent} + \beta \epsilon_T, \quad \epsilon_T \sim \mathcal{N}(0,I),\]

where \(\beta\) is a hyperparameter that controls the strength of added stochasticity to the parents. The first term ensures that the mutated children preserve the high-reward region density, while the second term encourages exploration.

Intermediate denoising state mutation: To synthesize meaningful variations while preserving the intrinsic structure of the latent state \(x_t\), we propose an alternative mutation operator inspired by the reverse-time SDE:

\[ x_t^{\rm child}=x^{\rm parent}_t+\sigma_t \epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0,I), \]

where \(\sigma_t\) is the diffusion coefficient defined in reverse-time SDE, controlling the level of injected stochasticity.

Gallery: Image Generation
Test-Time Scaling for Stable Diffusion 2.1
example image
Test-Time Scaling for Flux.1-dev
example image

Gallery: Video Generation
Prompt: "Several robots coordinate to move a large object across a factory floor. The camera captures the synchronized movements of the robots from a bird's-eye view, showing their precise coordination. The shot then shifts to ground level, focusing on the smooth, synchronized actions of the robots as they work together."

Best of N

Particle Sampling

EvoSearch (Ours)

Prompt: "A teddy bear at the supermarket. The camera is zooming out."

Best of N

Particle Sampling

EvoSearch (Ours)

Prompt: "Two cars collide at an intersection."

Best of N

Particle Sampling

EvoSearch (Ours)

Prompt: "A spider with the body of a rabbit, scurrying across the ground with immense speed."

Best of N

Particle Sampling

EvoSearch (Ours)

Prompt: "A cat is on the right of a rock, then the cat runs to the left of the rock."

Best of N

Particle Sampling

EvoSearch (Ours)

Given equivalent inference time, Wan1.3B with EvoSearch can achieve competitive even better performance compared to the 10× larger Wan 14B model. These findings highlight the significant potential of test-time scaling as a complement to traditional training-time scaling laws for visual generative models, opening new avenues for future research.

Prompt: "An owl with the body of a tiger, prowling the night skies with sharp talons."

Wan14b

Wan1.3B + EvoSearch (Ours)

Prompt: "A cheetah doing yoga poses, stretching out its limbs with remarkable flexibility and focus"

Wan14b

Wan1.3B + EvoSearch (Ours)

Prompt: "A kite and a balloon flying side by side, each drifting gracefully in the wind."

Wan14b

Wan1.3B + EvoSearch (Ours)

Prompt: "A person's hair changes from black to blonde."

Wan14b

Wan1.3B + EvoSearch (Ours)

Prompt: "The plastic water cup turned into a metal water cup."

Wan14b

Wan1.3B + EvoSearch (Ours)

Prompt: "A wooden toy is placed gently on the surface of a small bowl of water."

Wan14b

Wan1.3B + EvoSearch (Ours)

Prompt: "A water droplet slides down the edge of a smooth sheet of aluminum, maintaining its spherical form."

Wan14b

Wan1.3B + EvoSearch (Ours)

BibTeX

@misc{he2025scaling,
        title={Scaling Image and Video Generation via Test-Time Evolutionary Search},
        author={Haoran He and Jiajun Liang and Xintao Wang and Pengfei Wan and Di Zhang and Kun Gai and Ling Pan},
        year={2025},
        eprint={2505.17618},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
    }