Part A: The Power of Diffusion Models

In this part, we write our own sampling loops using the DeepFloyd IF diffusion model to denoise input images, and generate new images, including image-to-image translations, inpainting/edits, visual anagrams, and hybrid images.

Sampling Loops

Implementing the Forward Process

We first begin by implementing the forward process for our diffusion model, which takes in a clean image, and scales and denoises it.

Original Image Noise Level 250 Noise Level 500 Noise Level 750
Original Image
Campanille Noise 250
Campanille Noise 500
Campanille Noise 750

Classical Denoising

We first try to denoise the image via Gaussian blurring:

Original Image Noise Level 250 Noise Level 500 Noise Level 750
Original Image
Campanille denoise Gaussian 250
Campanille denoise Gaussian 500
Campanille denoise Gaussian 750

One-Step Denoising

We now use the pre-trained diffusion model to denoise the image with text conditioning, using the prompt embedding "a high quality photo".

Original Image One-Step Denoise at Level 250 One-Step Denoise at Level 500 One-Step Denoise at Level 750
Original Image
Campanille denoise Gaussian 250
Campanille denoise Gaussian 500
Campanille denoise Gaussian 750

Iterative Denoising

We now iteravely denoise the image with strided timesteps, starting from 1000 iterations and using a stride of 30.

Original Image Noise Level 90 Noise Level 240 Noise Level 540 Noise Level 690
Original Image
Original Image Iteratively Denoised Image One-Step Denoised Campanile Gaussian Blurred Campanile
Original Image

Diffusion Model Sampling

We now show use iterave denoising to generate images from pure noise. Below are some results:

Sample 1 Sample 2 Sample 3 Sample 4 Sample 5

Classifier-Free Guidance (CFG)

In order to generate better quality images, we implement classifier-free guidance, with a null prompt for the unconditional guidance for CFG.

Sample 1 with CFG Sample 2 with CFG Sample 3 with CFG Sample 4 with CFG Sample 5 with CFG

Image-to-image Translation

For this part, we will take the original image of the Campanille, add noise to it, and then denoise it using iterative CFG, with differing levels of noise. This will function as a way to "edit" the original image.

SDEdit with i_start=1 SDEdit with i_start=3 SDEdit with i_start=5 SDEdit with i_start=7 SDEdit with i_start=10 SDEdit with i_start=20 Original Image
SDEdit with i_start=1 SDEdit with i_start=3 SDEdit with i_start=5 SDEdit with i_start=7 SDEdit with i_start=10 SDEdit with i_start=20 Original Image
SDEdit with i_start=1 SDEdit with i_start=3 SDEdit with i_start=5 SDEdit with i_start=7 SDEdit with i_start=10 SDEdit with i_start=20 Original Image

Editing Hand-Drawn and Web Images

We can also apply this process with hand-drawn images, such as the results below:

SDEdit with i_start=1 SDEdit with i_start=3 SDEdit with i_start=5 SDEdit with i_start=7 SDEdit with i_start=10 SDEdit with i_start=20 Original Image
SDEdit with i_start=1 SDEdit with i_start=3 SDEdit with i_start=5 SDEdit with i_start=7 SDEdit with i_start=10 SDEdit with i_start=20 Original Image
SDEdit with i_start=1 SDEdit with i_start=3 SDEdit with i_start=5 SDEdit with i_start=7 SDEdit with i_start=10 SDEdit with i_start=20 Original Image

Inpainting

We can use the same procedure to implement inpainting by taking advantage of image masks. Below are some results:

Original Image Mask Hole to Fill Image Inpainted
Original Image Mask Hole to Fill Image Inpainted

Text-Conditional Image-to-image Translation

We will now return to image-to-image translation, except now we will guide the model with custom embedded prompts, rather than simply "a high-quality photo". Below are some results:

"A rocket ship" with an image of the Campanille:

SDEdit with i_start=1 SDEdit with i_start=3 SDEdit with i_start=5 SDEdit with i_start=7 SDEdit with i_start=10 SDEdit with i_start=20 Original Image

"An oil-painting of a snowy village" and a hand-drawn thumbnail of a hill:

SDEdit with i_start=1 SDEdit with i_start=3 SDEdit with i_start=5 SDEdit with i_start=7 SDEdit with i_start=10 SDEdit with i_start=20 Original Image
h4>"A pencil" and a photo of a blocky flower:
SDEdit with i_start=1 SDEdit with i_start=3 SDEdit with i_start=5 SDEdit with i_start=7 SDEdit with i_start=10 SDEdit with i_start=20 Original Image

Visual Anagrams

Now, we will alter our current model to generate visual anagrams, where given two prompts, one prompt is visible when the image is rightside-up, and the other prompt is visible when the image is upside-down. Below are some results:

"An Oil Painting of People Around a Campfire" An Oil Painting of an Old Man
"An Oil Painting of an Old Man" "An Oil Painting of a Snowy Mountain Village"
"An Oil Painting of People Around a Campfire" "An Oil Painting of a Snowy Mountain Village"

Hybrid Images

Finally, we will alter the model so that it can generate hybrid images, where an image appears to be one prompt from up close, and another from far away. Below are some results:

"A Lithograph of a waterfall" "An Oil Painting of People Around a Campfire" "An Oil Painting of a Snowy Mountain Village"
"A Lithograph of a Skull" "A Lithograph of a Skull" "An Oil Painting of an Old Man"

Diffusion Models From Scratch!

In this part, we write our own sampling loops using the DeepFloyd IF diffusion model to denoise input images, and generate new images, including image-to-image translations, inpainting/edits, visual anagrams, and hybrid images.

Training a Single-Step Diffusion Model

Implementing the UNet

We begin by setting up our unconditional U-Net based off of the following architecture, with the operations defined as below:


Using the UNet to Train a Denoiser

Once we have our U-Net set up, we can set out to solve the denoising problem, where we attempt to denoise images with certain levels of noise.

We aim to denoise the MNIST dataset. We use the following equation to noise images, where z is our noisy image, x is our original image, epsilon is Gaussian-distributed random noise, and sigma is a scalar that determines how much noise to add.

Pictured below is the noising process on sigmas in the following range: 0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0.

Original Image
Original Image
Original Image

Training

We now train the U-Net to denoise images with sigma=0.5. We use a batch size of 256, 128 hidden dimensions, and train for 5 epochs.

Below is the loss curve, along with some results at epochs 1 and 5 respectively.

Original Image
Original Image
Original Image
Original Image

Out-of-Distribution Testing

We now see how well the model works to denoise images with sigmas other than 0.5. Below are some results, with denoising results from noised images with the following sigma values in the following range: 0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0.

Original Image
Original Image
Original Image
Original Image
Original Image
Original Image

Training a Diffusion Model

Adding Time Conditioning to UNet

We will now train a U-Net to Iteratively denoise an image, implementing DDPM.

We will incorporate time-conditioning by injecting scalar t values into our U-Net using the following architecture, with new operations defined as below:


Training the UNet

We will now train the U-Net with respect to the following algorithm:

We train for 20 epochs, with a batch size of 128, 64 hidden channels, using MSE loss with an Adam optimizer with a learning rate of 1e-3. We also use a exponential learning rate decay scheduler, with a gamma of 0.1 ^ (1 / epochs). Pictured below is the training loss curve for the model:


Sampling from the UNet

Once the U-Net is trained, we can sample from it in accordance with the algorithm below:

We train for 20 epochs, with a batch size of 128, 64 hidden channels, using MSE loss with an Adam optimizer with a learning rate of 1e-3. Pictured below is the training loss curve for the model:

Below are some results from sampling the model at epochs 5 and 20, respectively.


Adding Class-Conditioning to UNet

To improve our results and allow for more control for image generation, we add class conditioning to the model. In this case, we will condition the U-Net on the classes 0 through 9 (one class per digit).

This requires adding in two more fully connected blocks to our architecture to accomodate c, which can either be a single label or a vector of labels (which will be one-hot encoded). We apply dropout to the c vector, with a rate of 0.1.

The training algorithm is similar to the one seen in time-conditioning, except we take the labels into consideration.

We keep the same parameters for training. Below is the loss curve for the model across 20 epochs:


Sampling from the Class-Conditioned UNet

We now sample from the model, using classifier-free guidance with a gamma value of 0.5.

Pictured below is the algorithm used for sampling:

Pictured below are results from sampling at epochs 5 and 20 respectively.