Five diffusion papers worth reading today (May 22, 2026)

Five diffusion papers worth reading today (May 22, 2026)

Twelve diffusion-model preprints arrived in the May 21–22 window. The five selected — RiT, Bernini, VDT, Noise Schedule Design, and SEGA — all share a structural move: each identifies a default assumption in the existing toolkit and replaces it with a geometrically grounded alternative. RiT achieves FID 1.14 on ImageNet 256×256 with a vanilla DiT and frozen DINOv2 features (676M params, open code). Bernini (ByteDance) proposes a clean MLLM/DiT division-of-labor for unified video generation and editing. VDT derives the generative process from discrete-time stochastic optimal control, yielding simulation-free training and straight transport paths. Noise Schedule Design applies optimal control to schedule selection, proving Õ(d/n) convergence bounds. SEGA is a training-free resolution extrapolation method that scales RoPE attention per frequency band based on the latent's spectral energy.

ArXiv Diffusion Models Digest
2026/5/22 · 22:27
購読 2 件 · コンテンツ 6 件

リサーチノート

Twelve diffusion-model preprints landed in the ~24-hour window ending May 22. The five below share a structural move: each one starts by identifying a default assumption baked into the existing toolkit — about representation spaces, semantic planning, generative formalisms, noise schedule theory, or attention scaling — and replaces it with a construction grounded in the actual geometry of the problem. Different scopes, same diagnostic impulse.

1. RiT: vanilla DiT beats deeper architectures in the right representation space

ArXiv: 2605.21981 | Zhang, Le; Mang, Ning; Agrawal, Aishwarya | cs.CV
Peer-review status: Preprint (submitted 2026-05-21). Author institutions unconfirmed — sub-48h paper, ArXiv HTML page pending indexing at time of writing.
The standard response to improving a diffusion transformer is to make the architecture more complex — deeper heads, cross-attention refinements, specialized decoders. RiT asks whether that's the right lever. Its central claim: when the representation space is well-conditioned for flow matching, a completely vanilla DiT suffices, with no architectural modifications at all. 1
The authors compare pixel space, SD-VAE latents, and frozen DINOv2 features (DINOv2 is Meta's self-supervised Vision Transformer trained via self-distillation with no labels) across four geometric axes. DINOv2 comes out ahead on all four: 7.3× higher effective rank, 35× better covariance conditioning, 11.5× lower excess kurtosis, and 1.7× lower on-manifold interpolation error — despite near-identical intrinsic dimensionality (d̂ ≈ 33 across spaces). The diagnosis is that representation-learning objectives, not compression, create favorable geometry. SD-VAE compresses aggressively but preserves too much high-frequency pixel structure; DINOv2's self-supervised objective smooths that away. 1
RiT adds just two augmentations to the vanilla DiT: a dimension-aware noise schedule (accounting for the non-isotropic covariance of DINOv2 features) and joint [CLS]-patch modeling. Nothing else changes. The result at 676M parameters:
  • FID 1.45 on ImageNet 256×256 without classifier-free guidance 1
  • FID 1.14 with CFG, beating DiT^DH-XL at 839M parameters 1
  • FID 2.0 at just 5 Heun steps, without any distillation 1
"We propose the Representation Image Transformer (RiT): a vanilla Diffusion Transformer trained by x-prediction on frozen DINOv2 features." 1
Code/resources: Full implementation and ImageNet 256×256 checkpoints available at github.com/lezhang7/RiT. 1
リンクプレビューを読み込んでいます…
Why read it: The FID 1.14 result with 19% fewer parameters than the prior state of the art is the headline, but the more durable contribution is the diagnostic framework — four geometric axes for evaluating whether a representation space is suitable for flow matching. That framework applies equally to future representation choices: any self-supervised encoder, any VQ-VAE codebook, any foundation model feature space. The open code and checkpoints mean the four-axis analysis can be reproduced and extended without retraining from scratch.

2. Bernini: semantic planning meets pixel rendering for unified video diffusion

ArXiv: 2605.22344 | ByteDance | cs.CV
Peer-review status: Preprint (submitted 2026-05-21). Full author list not yet confirmed — sub-48h paper. Institutional affiliation: ByteDance. 2
Current video generation and editing pipelines treat these as separate problems, with different architectures, different training pipelines, and no shared interface. Bernini argues the division of labor is wrong, not the underlying models. The proposal: use a multimodal large language model (MLLM) for semantic planning and a DiT for pixel rendering, with the ViT embedding space as the interface between them. 2
"We argue that these two families can be unified through a simple division of labor: MLLMs perform semantic planning, while diffusion models render pixels from high-level semantic guidance and low-level visual features." 2
The MLLM planner operates in ViT embedding space, predicting target semantic representations for the scene. The DiT renderer then denoises VAE latents conditioned on those representations plus raw visual features from the input. Crucially, the two components are trained separately with only lightweight co-training at the end — the semantic interface decouples their objectives. 2
Two specific technical contributions carry the implementation. First, Segment-Aware 3D Rotary Positional Embedding (SA-3D RoPE) handles multiple visual inputs — reference frames, masks, partial observations — within a single sequence without the positional ambiguity that arises when frames from different sources share a 3D position space. Second, the MLLM planner uses chain-of-thought reasoning to produce structured intermediate representations before committing to a final plan, which the authors report improves handling of complex multi-step editing operations. 2
The paper reports state of the art across "a wide range of video generation and editing benchmarks"; specific metrics are in the full paper, which was not yet accessible at time of writing.
Code/resources: No public repository or project page confirmed at time of writing.
Why read it: The modular design has practical implications beyond the benchmarks. Training the MLLM planner and DiT renderer separately means each component can leverage its respective pretraining ecosystem without interference — future improvements in either MLLMs or video DiTs propagate into Bernini by component swap. The SA-3D RoPE mechanism for handling heterogeneous visual inputs is independently useful for any architecture that needs to process mixed-source frame sequences.

3. Generative modeling by value-driven transport

ArXiv: 2605.22507 | Pablo Moreno-Muñoz, Adrian Müller, Gergely Neu | cs.LG / stat.ML
Peer-review status: Preprint (submitted 2026-05-21). Gergely Neu is associated with Universitat Pompeu Fabra based on prior work; institutional affiliation unconfirmed for this paper.
Diffusion models, flow matching, and Schrödinger bridges all derive the generative process from an SDE or ODE formulation. VDT (Value-Driven Transport) takes a different starting point: discrete-time stochastic optimal control. The question it answers is not "what SDE transports noise to data?" but "what control policy minimizes the cost of transporting a source measure to a target measure?" 3
The formulation casts measure transport as a linear program. The dual variables of that LP correspond to the optimal value function of a stochastic control problem, which encodes the optimal policy for moving probability mass from source to target. The authors develop a simulation-free primal-dual algorithm for computing approximately optimal value functions — simulation-free meaning the algorithm does not require rollouts through the forward process during training, which is one of the computational bottlenecks in Schrödinger bridge methods. 3
The resulting VDT policies produce straight transport paths between source and target distributions — similar in spirit to flow matching's rectified flows, but derived from control theory rather than from continuous normalizing flow intuitions. Practical desiderata follow naturally from the control framing:
  • Conditional generation is a constrained control problem, with the conditioning signal entering as a constraint on the target measure.
  • Classifier-free guidance becomes a standard reward-shaping operation within the control objective.
  • Unpaired data-to-data translation is a straightforward extension to general source/target pairs, not a special case requiring separate formulation. 3
"We show that well-trained VDT policies enjoy numerous favorable properties in comparison with other state-of-the-art methods based on flows, diffusions, or Schrödinger bridges." 3
Specific benchmark results are described as "strong performance and good potential for scalability"; quantitative tables are in the full paper.
Code/resources: No public repository confirmed at time of writing.
リンクプレビューを読み込んでいます…
Why read it: VDT is primarily a theoretical contribution, and the audience for it is researchers who care about the foundations of generative modeling — why straight transport paths arise, why simulation-free training is achievable, and whether the control-theoretic lens generates genuinely new algorithmic ideas or just reproves existing results from a different angle. The LP / primal-dual construction is the answer to that last question: it gives a principled path to approximate VDT policies that is computationally distinct from score matching or continuous normalizing flows, and the straight-path property emerges without rectification as a consequence of the optimal control objective.

4. Noise schedule design as an optimal control problem

ArXiv: 2605.21911 | Authors unconfirmed (sub-48h paper) | cs.LG
Peer-review status: Preprint (submitted 2026-05-21). Author list and institutional affiliation unavailable at time of writing.
Noise schedules — the functions that control how quickly noise is added in the forward process — are central to diffusion model performance. In practice they are designed empirically: cosine, linear, sigmoid variants are tested on downstream FID and the best-performing schedule is kept. The theoretical literature offers sampling error bounds of the form Õ(d/n) (d = data dimension, n = discretization steps), but as this paper points out, those bounds hold only for noise schedules that no practitioner actually uses. 4
"While existing theoretical work also prove that Õ(d/n) sampling error bounds are achievable, these results hold for specific noise schedules, which do not include the schedules used in practice." 4
The paper closes this gap by reformulating noise schedule design as an optimal control problem. The state is the Fisher information of the diffusion process at each time step; the control input is the noise schedule itself. The objective — a functional of Fisher information — is an upper bound on the KL divergence between the true and approximate sampling distributions, so minimizing it directly minimizes sampling error. 4
From this formulation, the authors derive two results:
  • Theoretical: sufficient conditions on noise schedules under which the Õ(d/n) sampling error bound holds — and these conditions are shown to be satisfied by the exponential and sigmoid schedules used in practice, not just by theoretical constructions.
  • Practical: under parametric assumptions on the noise schedule family, the optimal control problem yields closed-form expressions for the optimal schedule. These expressions generalize the exponential and sigmoid families with a set of tunable parameters, providing a principled design space rather than a grid of hand-picked candidates. 4
On image generation benchmarks, the new schedules achieve superior FID compared to standard exponential and sigmoid baselines; dataset and exact metric values are in the full paper.
Code/resources: No public repository confirmed at time of writing.
Why read it: Noise schedule design sits at the intersection of theory and practice in a way that few diffusion topics do. The practical schedules (cosine, sigmoid) work but nobody knows exactly why; the theoretical schedules provably work but nobody uses them. This paper is the first to write down why the practical schedules work and to derive better alternatives from first principles rather than from search. For teams running systematic ablations on noise schedule choices, the closed-form parametric family provides a structured one-dimensional space to explore instead of a sparse grid.

5. SEGA: frequency-aware RoPE scaling for resolution extrapolation

ArXiv: 2605.22668 | Authors unconfirmed (sub-48h paper) | cs.CV
Peer-review status: Preprint (submitted 2026-05-21). Author list unavailable; project page lead: rajabi2001. 5
DiTs trained at a fixed resolution generate poorly when prompted at higher resolutions: global structure collapses, fine details blur, and artifacts appear at predicted tile boundaries. The standard training-free fix is RoPE extrapolation with attention scaling — stretch the positional encoding to cover the larger resolution and scale down the attention logits to prevent attention collapse. The problem is that this scaling is applied uniformly across all RoPE frequency components, treating low-frequency components (responsible for global structure) and high-frequency components (responsible for fine detail) identically. The result is a fixed trade-off: whatever scaling factor you choose over-extrapolates one end and under-extrapolates the other. 5
SEGA (Spectral-Energy Guided Attention) breaks this constraint by making the scaling content-adaptive and per-step. At each denoising step, the method analyzes the spatial-frequency energy distribution of the current latent and computes a separate scaling factor for each RoPE frequency band based on that distribution. Low-frequency components — which carry global structure information — get a different scaling from high-frequency components — which carry local detail. The scaling adapts across steps as the latent transitions from noisy to structured. 5
"We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step." 5
No retraining or fine-tuning is required. The method outperforms state-of-the-art training-free baselines across multiple target resolutions; specific metrics are in the full paper.
Code/resources: Project page at rajabi2001.github.io/sega. No code repository confirmed at time of writing.
リンクプレビューを読み込んでいます…
Why read it: Training-free methods have the shortest path from paper to deployment — no retraining budget required, no checkpoint storage, plug directly into the existing inference loop. SEGA's core insight is that RoPE frequency components carry different semantic content and should be scaled differently, which is a simple observation with direct practical consequences. The project page is live, suggesting the authors are prepared for early adoption and feedback. For anyone generating images above their model's training resolution today, this is the most immediately applicable paper in today's batch.

Quick reference

PaperCore contributionPeer-review statusCode
RiT (2605.21981)Frozen DINOv2 features + vanilla DiT achieves FID 1.14 on ImageNet 256 with 676M paramsPreprintGitHub (checkpoints included)
Bernini (2605.22344)MLLM semantic planner + DiT pixel renderer, unified via ViT embedding interface; SA-3D RoPEPreprintNot confirmed
VDT (2605.22507)Measure transport as LP + primal-dual algorithm → straight transport paths; simulation-freePreprintNot confirmed
Noise Schedule (2605.21911)Optimal control formulation for noise schedule design; closed-form family generalizing exp/sigmoidPreprintNot confirmed
SEGA (2605.22668)Per-step, per-frequency-band RoPE scaling driven by latent spectral energy; training-freePreprintProject page live
The common thread: RiT rewrites the question "which architecture?" as "which representation space?", VDT and the noise schedule paper reframe two empirical design choices as optimal control problems with closed-form solutions, Bernini's ViT embedding interface treats the MLLM/diffusion boundary as an engineering interface rather than a coupling problem, and SEGA replaces a scalar attention-scaling factor with one derived from the latent's actual spectral content. In each case the gain comes from reducing an arbitrary design choice to a problem with a principled answer.

このコンテンツについて、さらに観点や背景を補足しましょう。

  • ログインするとコメントできます。