
We propose a Product of Experts (PoE) framework for visual synthesis tasks. The framework performs inference-time knowledge composition from heterogeneous sources including visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators.
Task Setup. Given an input image and an input physics simulation sequence describing precise object motions, we aim to generate natural videos that animate the input image with simulation-aligned object motions.
Expert Setup. This task can be cast into sampling from the product distribution p(x) from (1) a physics-aware expert ensuring physical accuracy and adherance to user controls, and (2) a video generation expert ensuring output plausibility.
Framework Instantiation. We adopt Annealed Importance Sampling (AIS) for efficient sampling from p(x), constructing intermediate distributions with (i) linear interpolants from a base distribution to p(x) or (ii) autoregressive factors of p(x) as described below.
Directly sampling from the target product distribution is often intractable. Therefore, we construct an annealing path for PoE sampling that gradually transitions from a simple Gaussian noise distribution to the target distribution via linear interpolants between the two.
Expert Instantiation. The two aforementioned experts are implemented as (1) a depth-to-video model conditioned on depth maps from input simulations, and (2) an image-to-video model, both using Wan2.1-I2V-14B as backbones. First video frames are synthesized using our method described here.
"A seagull stands in a shallow puddle. Other gulls mill about behind, while a chipped, weather-beaten ball sails just past a vigilant seagull."
Input Image
Input Simulation
Input Text
Image-to-Video (Baseline)
Depth-to-Video (Baseline)
Trajectory-to-Video (Baseline)
PoE (Ours)
"A brown-and-white dog crouches on a patterned red rug, eyes locked on a blue frisbee as light filters through the nearby curtains, while a rubber ball falls and bounces off the ground."
Input Image
Input Simulation
Input Text
Image-to-Video (Baseline)
Depth-to-Video (Baseline)
Trajectory-to-Video (Baseline)
PoE (Ours)
The sampling process is illustrated below. The goal is to sample from a target product distribution from two experts (blue and orange contour plots). A sequence of annealing distributions is constructed using linear interpolations between a standard normal distribution at t=T (T=5) and the target distribution at t=1. Each sampling iteration t consists of an MCMC initialization step to move particles from the annealing distribution at t+1 to t, followed by MCMC steps (Langevin sampling, visualized in videos below) for refinement.
t=5
t=4
t=3
t=2
t=1
Alternatively, the annealing path can be constructed autoregressively, generating one video chunk at a time conditioned on previous generation.
Expert Instantiation. The two experts are implemented as (1) a Gaussian distribution centered at pixel values from RGB renderings from input simulations (second column), and (2) an autoregressive image-to-video model FramePack.
First Frame
Input Simulation
Image-to-Video (Baseline)
PoE (Ours)
First Frame
Input Simulation
Image-to-Video (Baseline)
PoE (Ours)
With autoregressive annealing, the sampling process starts from t=1, considering only the first coordinate, and then iteratively accounts for more coordinates via MCMC steps (Gibbs sampling for the first t coordinates).
t=1 (global init)
t=1 (MCMC steps)
t=2 (MCMC init)
t=2 (MCMC steps)
Task Setup.
An user provides an image to be edited, a 3D asset posed in a graphics engine to specify where to insert the asset, and a text prompt describing higher-level information such as object materials ("metal") and semantics ("coke can").
Expert Setup. This task can be cast into sampling from the product distribution from (1) an expert inhering geometric constraints from the graphics engine, and (2) an image generation expert that provides a natural image prior to produce realistic visual outputs.
Expert Instantiation. The two experts are implemented as (1) FLUX.1 Depth [dev] conditioned on depth renderings, and (2) an image inpainting model, FLUX.1 Fill [dev], conditioned on pixels outside the to-be-inserted objects.
Input Image & Rendering
FLUX RF-Solver
Ours
"A big coke can standing on the ground."
Input Image & Rendering
FLUX RF-Solver
Ours
"A clear glass container on the ground."
Input Image & Rendering
FLUX RF-Solver
Ours
"A polished, mirror-bright metal box on the ground."
Task and Expert Setup. We formulate text-to-image generation task as sampling from the product of multiple regional image generators, each controlling a specific image region, and a discriminative expert, e.g., a visual language model. Doing so allows us to decompose a complex text prompt into simpler components, and to incorporate knowledge from any discriminative model that defines an image distribution but does not permit direct sampling.
Expert Instantiation. An input prompt is parsed into a set of region-specific prompts with associated bounding boxes by an LLM. We use FLUX.1-[dev] as the regional generators and VQAScore as the discriminative expert.
Particle 1
Particle 2
Particle 3
Particle 4
"Five ants are carrying biscuits, an ant is standing on a leaf directing them."
The target distribution is the product of two generative experts (blue and orange contour plots) and one discriminative expert (assigning reward 1 for samples within the red circle and 0 outside). For each iteration t, MCMC steps are followed by a resampling step (shown in the last frame of each video) that reweights particles based on rewards.
t=5
t=4
t=3
t=2
t=1
Task Setup. We introduce a variant of the sim-to-video generation task from above where input simulations specify content for full scenes as opposed to foreground objects only. Expert setup and instantiation are the same as the linear annealing section.
Input Image
Ours
DDIM Inversion
PhysGen3D Simulation
Depth-to-Video
Image-to-Video
@article{zhang2025poe,
title = {Product of Experts for Visual Generation},
author = {Yunzhi Zhang and Carson Murtuza-Lanier and Zizhang Li and Yilun Du and Jiajun Wu},
year = {2025},
journal = {arXiv preprint arXiv:2506.08894},
}