Explain It Like I'm Five: Diffusion transformers

What is a diffusion transformer?

It’s actually a combination of two concepts — diffusion and transformers — that make better-performing AI image and video generators.

Okay, so what is diffusion?

It’s how most modern AI media generators are trained. Noise — like static or image grain — is added to images or other pieces of media until they are completely obscured. An AI trained this way learns how to reverse this process, which in turn teaches it what details are important to keep, resulting in more accurate and detailed results.

And a transformer?

Used in large language models like Gemini and ChatGPT, transformers process an input all at once, instead of word-by-word in a sentence or pixel-by-pixel in an image. By ‘seeing’ everything at the same time, it can use context to understand what is relevant and what is noise. This is simpler than other approaches and processes more info, faster.

Why combine them?

Most diffusion-based models have something called a U-net that estimates how much noise needs to be removed. They are effective, but also very complex and contain multiple modules, which can slow down models. Replacing U-nets with transformers can make AI run more efficiently, and the time/computing power saved can be used to make higher-quality outputs.

Who is using them?

Diffusion transformers are a relatively new concept in AI, first proposed by NYU professor Saining Xie in 2022. But we’ve recently gotten two (pretty impressive) examples of what they look like in practice: OpenAI’s Sora video generator, and the latest version of the Stable Diffusion image generator.