Content-Aware Dynamic Patchification for Efficient Video Diffusion

Abstract

Diffusion Transformers (DiTs) achieve strong video generation performance but suffer from prohibitive computation cost due to dense spatiotemporal tokenization. Most existing works rely on uniform patchification, tokenizing non-overlapping spatiotemporal regions with a fixed patch size regardless of the underlying content. This content-agnostic tokenization results in substantial redundant computation, especially in visually simple or static areas. To address this inefficiency while preserving video generation quality, we propose DynaPatch, a fine-grained dynamic patchification framework that adaptively selects patch sizes for each spatiotemporal region based on content complexity. A lightweight router predicts patch sizes directly from the latents encoded by a 3D Variational Autoencoder (VAE), and is jointly optimized with the diffusion model through diffusion loss, an attention-guided saliency alignment loss, and a token-budget regularizer. Learnable patchify and unpatchify layers integrate seamlessly with standard DiT backbones, allowing flexible tokenization without architectural changes. Experiments demonstrate that DynaPatch can effectively reduce redundant computations while preserving fine details, achieving 1.3-1.8x acceleration with minimal quality degradation.

Design

Overall Workflow

Encode the video into spatiotemporal latents. The 3D VAE encoder produces the latent representation that serves as the routing input.
Predict a region-wise patch-size map. A lightweight router process the noisy latent and selects patch sizes for different spatiotemporal regions based on content complexity.
Patchify adaptively and run DiT denoising. Patchify layers convert each region into tokens at the selected granularity, and the resulting tokens are processed by standard DiT blocks.
Unpatchify and decode the final video. Unpatchify layers restore the latent grid to its original resolution before the 3D VAE decoder reconstructs the output video.

This workflow preserves the original DiT backbone while allocating more computation to important or dynamic regions and fewer tokens to redundant areas.

Overview of the DynaPatch inference workflow.

Training the Router

1
Diffusion loss
Jointly trains the router with the DiT backbone so routing decisions remain compatible with high-quality denoising.
2
Attention guidance
Aligns fine patch assignments with semantically important regions indicated by the model's attention maps.
3
Token-budget loss
Regularizes the routed token count toward a target compute budget instead of collapsing to all-fine patches.

Results

Token Reduction	Method	Total Score	Quality Score	Semantic Score	Speedup
0%	Baseline	83.61	84.87	78.59	1.0x
20%	DynaPatch	83.56	84.79	78.62	1.3x
30%	DynaPatch	83.42	84.68	78.36	1.5x
40%	DynaPatch	82.19	83.92	75.29	1.8x

Comparison with baseline on VBench. DynaPatch maintains competitive scores while reducing tokens and improving inference speed.

Visualization

Generated Videos

Prompt

A highly detailed portrait of a porcelain donkey being covered by thick slime. Cinematic, highly detailed, film grade.

Prompt

A low angle hyper realistic shot of a thick purple goo flowing quickly down a white marble staircase. Cinematic, highly detailed, film grade.

Prompt

A cinematic documentary hand held close up of a woman standing in a busy Italian plaza smirking to herself, the background soft and out of focus, diffused overhead lighting. Her skin has freckles and small creases, her hair is down and a bit messy. Muted colors, diffused cinematic lighting, cool color grade.

Prompt

An ultra-fast first-person POV hyper-lapse rapidly speeding through a forest fire into a snow capped mountain.

Prompt

A rabbit made of liquid gold with a dark background, liquidify, cinematic, vfx.

Prompt

Mystical video of a cat dressed as astrologer performing colorful magic spells in front of an eager crowd of village mice, the mice watch in awe as the cat conjures sparkling constellations, floating orbs, and magical symbols in the air.