
Five diffusion papers worth reading today (May 22, 2026)
Twelve diffusion-model preprints arrived in the May 21–22 window. The five selected — RiT, Bernini, VDT, Noise Schedule Design, and SEGA — all share a structural move: each identifies a default assumption in the existing toolkit and replaces it with a geometrically grounded alternative. RiT achieves FID 1.14 on ImageNet 256×256 with a vanilla DiT and frozen DINOv2 features (676M params, open code). Bernini (ByteDance) proposes a clean MLLM/DiT division-of-labor for unified video generation and editing. VDT derives the generative process from discrete-time stochastic optimal control, yielding simulation-free training and straight transport paths. Noise Schedule Design applies optimal control to schedule selection, proving Õ(d/n) convergence bounds. SEGA is a training-free resolution extrapolation method that scales RoPE attention per frequency band based on the latent's spectral energy.

리서치 브리프
1. RiT: vanilla DiT beats deeper architectures in the right representation space
- FID 1.45 on ImageNet 256×256 without classifier-free guidance 1
- FID 1.14 with CFG, beating DiT^DH-XL at 839M parameters 1
- FID 2.0 at just 5 Heun steps, without any distillation 1
"We propose the Representation Image Transformer (RiT): a vanilla Diffusion Transformer trained by x-prediction on frozen DINOv2 features." 1
2. Bernini: semantic planning meets pixel rendering for unified video diffusion
"We argue that these two families can be unified through a simple division of labor: MLLMs perform semantic planning, while diffusion models render pixels from high-level semantic guidance and low-level visual features." 2
3. Generative modeling by value-driven transport
- Conditional generation is a constrained control problem, with the conditioning signal entering as a constraint on the target measure.
- Classifier-free guidance becomes a standard reward-shaping operation within the control objective.
- Unpaired data-to-data translation is a straightforward extension to general source/target pairs, not a special case requiring separate formulation. 3
"We show that well-trained VDT policies enjoy numerous favorable properties in comparison with other state-of-the-art methods based on flows, diffusions, or Schrödinger bridges." 3
4. Noise schedule design as an optimal control problem
"While existing theoretical work also prove that Õ(d/n) sampling error bounds are achievable, these results hold for specific noise schedules, which do not include the schedules used in practice." 4
- Theoretical: sufficient conditions on noise schedules under which the Õ(d/n) sampling error bound holds — and these conditions are shown to be satisfied by the exponential and sigmoid schedules used in practice, not just by theoretical constructions.
- Practical: under parametric assumptions on the noise schedule family, the optimal control problem yields closed-form expressions for the optimal schedule. These expressions generalize the exponential and sigmoid families with a set of tunable parameters, providing a principled design space rather than a grid of hand-picked candidates. 4
5. SEGA: frequency-aware RoPE scaling for resolution extrapolation
"We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step." 5
Quick reference
| Paper | Core contribution | Peer-review status | Code |
|---|---|---|---|
| RiT (2605.21981) | Frozen DINOv2 features + vanilla DiT achieves FID 1.14 on ImageNet 256 with 676M params | Preprint | GitHub (checkpoints included) |
| Bernini (2605.22344) | MLLM semantic planner + DiT pixel renderer, unified via ViT embedding interface; SA-3D RoPE | Preprint | Not confirmed |
| VDT (2605.22507) | Measure transport as LP + primal-dual algorithm → straight transport paths; simulation-free | Preprint | Not confirmed |
| Noise Schedule (2605.21911) | Optimal control formulation for noise schedule design; closed-form family generalizing exp/sigmoid | Preprint | Not confirmed |
| SEGA (2605.22668) | Per-step, per-frequency-band RoPE scaling driven by latent spectral energy; training-free | Preprint | Project page live |
참고 출처
- 1RiT: Vanilla Diffusion Transformers Suffice in Representation Space (arXiv 2605.21981)
- 2Bernini: Latent Semantic Planning for Video Diffusion (arXiv 2605.22344)
- 3Generative Modeling by Value-Driven Transport (arXiv 2605.22507)
- 4Noise Schedule Design for Diffusion Models: An Optimal Control Perspective (arXiv 2605.21911)
- 5SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers (arXiv 2605.22668)
이 콘텐츠를 둘러싼 관점이나 맥락을 계속 보강해 보세요.