Product of Experts for Visual Generation

arXiv Preprint 2025

Yunzhi Zhang¹

Carson Murtuza-Lanier¹

Zizhang Li¹

Yilun Du²

Jiajun Wu¹

¹Stanford University

²Harvard University

[arXiv]

[Code (Coming Soon)]

We propose a Product of Experts (PoE) framework for visual synthesis tasks. The framework performs inference-time knowledge composition from heterogeneous sources including visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators.

Physics-Simulator-Instructed Video Generation

Task Setup. Given an input image and an input physics simulation sequence describing precise object motions, we aim to generate natural videos that animate the input image with simulation-aligned object motions.
Expert Setup. This task can be cast into sampling from the product distribution p(x) from (1) a physics-aware expert ensuring physical accuracy and adherance to user controls, and (2) a video generation expert ensuring output plausibility.
Framework Instantiation. We adopt Annealed Importance Sampling (AIS) for efficient sampling from p(x), constructing intermediate distributions with (i) linear interpolants from a base distribution to p(x) or (ii) autoregressive factors of p(x) as described below.

(i) PoE Sampling with a Linear Annealing Path

Directly sampling from the target product distribution is often intractable. Therefore, we construct an annealing path for PoE sampling that gradually transitions from a simple Gaussian noise distribution to the target distribution via linear interpolants between the two.
Expert Instantiation. The two aforementioned experts are implemented as (1) a depth-to-video model conditioned on depth maps from input simulations, and (2) an image-to-video model, both using Wan2.1-I2V-14B as backbones. First video frames are synthesized using our method described here.

"A seagull stands in a shallow puddle. Other gulls mill about behind, while a chipped, weather-beaten ball sails just past a vigilant seagull."

Input Image

Input Simulation

Input Text

Image-to-Video (Baseline)

Depth-to-Video (Baseline)

Trajectory-to-Video (Baseline)

PoE (Ours)

"A brown-and-white dog crouches on a patterned red rug, eyes locked on a blue frisbee as light filters through the nearby curtains, while a rubber ball falls and bounces off the ground."

Input Image

Input Simulation

Input Text

Image-to-Video (Baseline)

Depth-to-Video (Baseline)

Trajectory-to-Video (Baseline)

PoE (Ours)

"Bathed in violet club lighting, a focused DJ tweaks the mixer following the beat, while a soap bubble falls and bounces off the table."

Input Image

Input Simulation

Input Text

Image-to-Video (Baseline)

Depth-to-Video (Baseline)

Trajectory-to-Video (Baseline)

PoE (Ours)

"In a neon-lit intersection are passing taxis, umbrella-clad crowds, flickering billboards on the rain-soaked street, and a person passing right in front of the camera, while a reflective metal sphere falls down."

Input Image

Input Simulation

Input Text

Image-to-Video (Baseline)

Depth-to-Video (Baseline)

Trajectory-to-Video (Baseline)

PoE (Ours)

"Pigeons are startled into a flurry of wing-beats as they lift off and scatter around the parked bicycle, as the white cloth slips off the box and flutters to the pavement."

Input Image

Input Simulation

Input Text

Image-to-Video (Baseline)

Depth-to-Video (Baseline)

Trajectory-to-Video (Baseline)

PoE (Ours)

"A woman watches city lights awaken beneath the violet-orange sunset, her hair blowing in the wind, while a faint breeze lifts the corners of a cloth draped over a stack of books on the rooftop ledge, fluttering beside."

Input Image

Input Simulation

Input Text

Image-to-Video (Baseline)

Depth-to-Video (Baseline)

Trajectory-to-Video (Baseline)

PoE (Ours)

"With each gust, the draped cloth slips off the wooden box, one edge flapping free before the whole sheet billows away down the muddy track. A distant car emerges through the hazy golden mist, its low rumble stirring loose dust and roadside tufts that whirl in the rising breeze."

Input Image

Input Simulation

Input Text

Image-to-Video (Baseline)

Depth-to-Video (Baseline)

Trajectory-to-Video (Baseline)

PoE (Ours)

"A seagull steps forward to investigate a rock covered with a cloth, as a sea breeze knocks the cloth off a rock, fluttering it toward the sand."

Input Image

Input Simulation

Input Text

Image-to-Video (Baseline)

Depth-to-Video (Baseline)

Trajectory-to-Video (Baseline)

PoE (Ours)

"A bike rider speeds along the wet pavement, pedaling forward as water splashes up from the wheels."

Input Image

Input Simulation

Input Text

Image-to-Video (Baseline)

Depth-to-Video (Baseline)

Trajectory-to-Video (Baseline)

PoE (Ours)

"Visitors wander across the rain-slick carnival as the Ferris wheel turns and its multicolored bulbs pulse in changing patterns, while a fountain's arcing spray weakens, dribbling into the reflective puddles below."

Input Image

Input Simulation

Input Text

Image-to-Video (Baseline)

Depth-to-Video (Baseline)

Trajectory-to-Video (Baseline)

PoE (Ours)

"Ducks and pigeons walk around lively. Reflective waters stretch toward snow-tipped mountains under a cloudy sky. The spray in the foreground sinks as its pressure fades."

Input Image

Input Simulation

Input Text

Image-to-Video (Baseline)

Depth-to-Video (Baseline)

Trajectory-to-Video (Baseline)

PoE (Ours)

"Heavy raindrops burst into tiny splashes and bubbles across a water-slick surface in a dramatic close-up. The spray in the foreground sinks as its pressure fades."

Input Image

Input Simulation

Input Text

Image-to-Video (Baseline)

Depth-to-Video (Baseline)

Trajectory-to-Video (Baseline)

PoE (Ours)

The sampling process is illustrated below. The goal is to sample from a target product distribution from two experts (blue and orange contour plots). A sequence of annealing distributions is constructed using linear interpolations between a standard normal distribution at t=T (T=5) and the target distribution at t=1. Each sampling iteration t consists of an MCMC initialization step to move particles from the annealing distribution at t+1 to t, followed by MCMC steps (Langevin sampling, visualized in videos below) for refinement.

t=5

t=4

t=3

t=2

t=1

(ii) PoE Sampling with an Autoregressive Annealing Path

Alternatively, the annealing path can be constructed autoregressively, generating one video chunk at a time conditioned on previous generation.
Expert Instantiation. The two experts are implemented as (1) a Gaussian distribution centered at pixel values from RGB renderings from input simulations (second column), and (2) an autoregressive image-to-video model FramePack.

First Frame

Input Simulation

Image-to-Video (Baseline)

PoE (Ours)

First Frame

Input Simulation

Image-to-Video (Baseline)

PoE (Ours)

With autoregressive annealing, the sampling process starts from t=1, considering only the first coordinate, and then iteratively accounts for more coordinates via MCMC steps (Gibbs sampling for the first t coordinates).

t=1 (global init)

t=1 (MCMC steps)

t=2 (MCMC init)

t=2 (MCMC steps)

Graphics-Engine-Instructed Image Editing

Task Setup. An user provides an image to be edited, a 3D asset posed in a graphics engine to specify where to insert the asset, and a text prompt describing higher-level information such as object materials ("metal") and semantics ("coke can").
Expert Setup. This task can be cast into sampling from the product distribution from (1) an expert inhering geometric constraints from the graphics engine, and (2) an image generation expert that provides a natural image prior to produce realistic visual outputs.
Expert Instantiation. The two experts are implemented as (1) FLUX.1 Depth [dev] conditioned on depth renderings, and (2) an image inpainting model, FLUX.1 Fill [dev], conditioned on pixels outside the to-be-inserted objects.

Input Image & Rendering

FLUX RF-Solver

Ours

"A big coke can standing on the ground."

Input Image & Rendering

FLUX RF-Solver

Ours

"A clear glass container on the ground."

Input Image & Rendering

FLUX RF-Solver

Ours

"A polished, mirror-bright metal box on the ground."

Below shows uncurated outputs of our method across 10 input images and 3 graphics engine rendering instructions, each sampled with 4 seeds. Inputs are shown in the leftmost column.

Input Image & Rendering

Seed 0

Seed 1

Seed 2

Seed 3

"A big coke can standing on the ground."

Input Image & Rendering

Seed 0

Seed 1

Seed 2

Seed 3

"A clear glass container on the ground."

Input Image & Rendering

Seed 0

Seed 1

Seed 2

Seed 3

"A polished, mirror-bright metal box on the ground."

Text-to-Image Generation

Task and Expert Setup. We formulate text-to-image generation task as sampling from the product of multiple regional image generators, each controlling a specific image region, and a discriminative expert, e.g., a visual language model. Doing so allows us to decompose a complex text prompt into simpler components, and to incorporate knowledge from any discriminative model that defines an image distribution but does not permit direct sampling.
Expert Instantiation. An input prompt is parsed into a set of region-specific prompts with associated bounding boxes by an LLM. We use FLUX.1-[dev] as the regional generators and VQAScore as the discriminative expert.

Particle 1

Particle 2

Particle 3

Particle 4

"Five ants are carrying biscuits, an ant is standing on a leaf directing them."

Particles throughout sampling are visualized below, with rewards assigned by the discriminative expert shown below each image. Starting from early iterations (top rows), particles are updated with MCMC steps and either discarded or selected according to the rewards, until reaching the target distribution (bottom row).

0.449 (discarded)

0.291 (discarded)

0.590 (selected)

0.472 (selected)

0.565 (discarded)

0.531 (discarded)

0.706 (selected)

0.695 (selected)

0.869 (discarded)

0.914 (selected)

0.914 (discarded)

0.899 (discarded)

The target distribution is the product of two generative experts (blue and orange contour plots) and one discriminative expert (assigning reward 1 for samples within the red circle and 0 outside). For each iteration t, MCMC steps are followed by a resampling step (shown in the last frame of each video) that reweights particles based on rewards.

t=5

t=4

t=3

t=2

t=1

Physics-Simulator-Instructed Video Editing

Task Setup. We introduce a variant of the sim-to-video generation task from above where input simulations specify content for full scenes as opposed to foreground objects only. Expert setup and instantiation are the same as the linear annealing section.

Input Image

Ours

DDIM Inversion

PhysGen3D Simulation

Depth-to-Video

Image-to-Video

Input Image

Ours

DDIM Inversion

PhysGen3D Simulation

Depth-to-Video

Image-to-Video

Input Image

Ours

DDIM Inversion

PhysGen3D Simulation

Depth-to-Video

Image-to-Video

Input Image

Ours

DDIM Inversion

PhysGen3D Simulation

Depth-to-Video

Image-to-Video

Input Image

Ours

DDIM Inversion

PhysGen3D Simulation

Depth-to-Video

Image-to-Video

Input Image

Ours

DDIM Inversion

PhysGen3D Simulation

Depth-to-Video

Image-to-Video

Input Image

Ours

DDIM Inversion

PhysGen3D Simulation

Depth-to-Video

Image-to-Video

Input Image

Ours

DDIM Inversion

PhysGen3D Simulation

Depth-to-Video

Image-to-Video

BibTeX

@article{zhang2025poe,
  title     = {Product of Experts for Visual Generation},
  author    = {Yunzhi Zhang and Carson Murtuza-Lanier and Zizhang Li and Yilun Du and Jiajun Wu},
  year      = {2025},
  journal   = {arXiv preprint arXiv:2506.08894},
}