Part A: The Power of Diffusion Models

In this part, we write our own sampling loops using the DeepFloyd IF diffusion model to denoise input images, and generate new images, including image-to-image translations, inpainting/edits, visual anagrams, and hybrid images.

Sampling Loops

Implementing the Forward Process

We first begin by implementing the forward process for our diffusion model, which takes in a clean image, and scales and denoises it.

Original Image	Noise Level 250	Noise Level 500	Noise Level 750

Classical Denoising

We first try to denoise the image via Gaussian blurring:

Original Image	Noise Level 250	Noise Level 500	Noise Level 750

One-Step Denoising

We now use the pre-trained diffusion model to denoise the image with text conditioning, using the prompt embedding "a high quality photo".

Original Image	One-Step Denoise at Level 250	One-Step Denoise at Level 500	One-Step Denoise at Level 750

Iterative Denoising

We now iteravely denoise the image with strided timesteps, starting from 1000 iterations and using a stride of 30.

Original Image	Noise Level 90	Noise Level 240	Noise Level 540	Noise Level 690

Original Image	Iteratively Denoised Image	One-Step Denoised Campanile	Gaussian Blurred Campanile

Diffusion Model Sampling

We now show use iterave denoising to generate images from pure noise. Below are some results:

Sample 1	Sample 2	Sample 3	Sample 4	Sample 5

Classifier-Free Guidance (CFG)

In order to generate better quality images, we implement classifier-free guidance, with a null prompt for the unconditional guidance for CFG.

Sample 1 with CFG	Sample 2 with CFG	Sample 3 with CFG	Sample 4 with CFG	Sample 5 with CFG

Image-to-image Translation

For this part, we will take the original image of the Campanille, add noise to it, and then denoise it using iterative CFG, with differing levels of noise. This will function as a way to "edit" the original image.

SDEdit with i_start=1	SDEdit with i_start=3	SDEdit with i_start=5	SDEdit with i_start=7	SDEdit with i_start=10	SDEdit with i_start=20	Original Image

SDEdit with i_start=1	SDEdit with i_start=3	SDEdit with i_start=5	SDEdit with i_start=7	SDEdit with i_start=10	SDEdit with i_start=20	Original Image

SDEdit with i_start=1	SDEdit with i_start=3	SDEdit with i_start=5	SDEdit with i_start=7	SDEdit with i_start=10	SDEdit with i_start=20	Original Image

Editing Hand-Drawn and Web Images

We can also apply this process with hand-drawn images, such as the results below:

SDEdit with i_start=1	SDEdit with i_start=3	SDEdit with i_start=5	SDEdit with i_start=7	SDEdit with i_start=10	SDEdit with i_start=20	Original Image

SDEdit with i_start=1	SDEdit with i_start=3	SDEdit with i_start=5	SDEdit with i_start=7	SDEdit with i_start=10	SDEdit with i_start=20	Original Image

SDEdit with i_start=1	SDEdit with i_start=3	SDEdit with i_start=5	SDEdit with i_start=7	SDEdit with i_start=10	SDEdit with i_start=20	Original Image

Inpainting

We can use the same procedure to implement inpainting by taking advantage of image masks. Below are some results:

Original Image	Mask	Hole to Fill	Image Inpainted

Original Image	Mask	Hole to Fill	Image Inpainted

Text-Conditional Image-to-image Translation

We will now return to image-to-image translation, except now we will guide the model with custom embedded prompts, rather than simply "a high-quality photo". Below are some results:

"A rocket ship" with an image of the Campanille:

SDEdit with i_start=1	SDEdit with i_start=3	SDEdit with i_start=5	SDEdit with i_start=7	SDEdit with i_start=10	SDEdit with i_start=20	Original Image

"An oil-painting of a snowy village" and a hand-drawn thumbnail of a hill:

SDEdit with i_start=1	SDEdit with i_start=3	SDEdit with i_start=5	SDEdit with i_start=7	SDEdit with i_start=10	SDEdit with i_start=20	Original Image

h4>"A pencil" and a photo of a blocky flower:

SDEdit with i_start=1	SDEdit with i_start=3	SDEdit with i_start=5	SDEdit with i_start=7	SDEdit with i_start=10	SDEdit with i_start=20	Original Image

Visual Anagrams

Now, we will alter our current model to generate visual anagrams, where given two prompts, one prompt is visible when the image is rightside-up, and the other prompt is visible when the image is upside-down. Below are some results:

"An Oil Painting of People Around a Campfire"	An Oil Painting of an Old Man

"An Oil Painting of an Old Man"	"An Oil Painting of a Snowy Mountain Village"

"An Oil Painting of People Around a Campfire"	"An Oil Painting of a Snowy Mountain Village"

Hybrid Images

Finally, we will alter the model so that it can generate hybrid images, where an image appears to be one prompt from up close, and another from far away. Below are some results:

"A Lithograph of a waterfall"	"An Oil Painting of People Around a Campfire"	"An Oil Painting of a Snowy Mountain Village"

"A Lithograph of a Skull"	"A Lithograph of a Skull"	"An Oil Painting of an Old Man"

Diffusion Models From Scratch!

Training a Single-Step Diffusion Model

Implementing the UNet

We begin by setting up our unconditional U-Net based off of the following architecture, with the operations defined as below:

Using the UNet to Train a Denoiser

Once we have our U-Net set up, we can set out to solve the denoising problem, where we attempt to denoise images with certain levels of noise.

We aim to denoise the MNIST dataset. We use the following equation to noise images, where z is our noisy image, x is our original image, epsilon is Gaussian-distributed random noise, and sigma is a scalar that determines how much noise to add.

Pictured below is the noising process on sigmas in the following range: 0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0.

Training

We now train the U-Net to denoise images with sigma=0.5. We use a batch size of 256, 128 hidden dimensions, and train for 5 epochs.

Below is the loss curve, along with some results at epochs 1 and 5 respectively.

Out-of-Distribution Testing

We now see how well the model works to denoise images with sigmas other than 0.5. Below are some results, with denoising results from noised images with the following sigma values in the following range: 0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0.

Training a Diffusion Model

Adding Time Conditioning to UNet

We will now train a U-Net to Iteratively denoise an image, implementing DDPM.

We will incorporate time-conditioning by injecting scalar t values into our U-Net using the following architecture, with new operations defined as below:

Training the UNet

We will now train the U-Net with respect to the following algorithm:

We train for 20 epochs, with a batch size of 128, 64 hidden channels, using MSE loss with an Adam optimizer with a learning rate of 1e-3. We also use a exponential learning rate decay scheduler, with a gamma of 0.1 ^ (1 / epochs). Pictured below is the training loss curve for the model:

Sampling from the UNet

Once the U-Net is trained, we can sample from it in accordance with the algorithm below:

We train for 20 epochs, with a batch size of 128, 64 hidden channels, using MSE loss with an Adam optimizer with a learning rate of 1e-3. Pictured below is the training loss curve for the model:

Below are some results from sampling the model at epochs 5 and 20, respectively.

Adding Class-Conditioning to UNet

To improve our results and allow for more control for image generation, we add class conditioning to the model. In this case, we will condition the U-Net on the classes 0 through 9 (one class per digit).

This requires adding in two more fully connected blocks to our architecture to accomodate c, which can either be a single label or a vector of labels (which will be one-hot encoded). We apply dropout to the c vector, with a rate of 0.1.

The training algorithm is similar to the one seen in time-conditioning, except we take the labels into consideration.

We keep the same parameters for training. Below is the loss curve for the model across 20 epochs:

Sampling from the Class-Conditioned UNet

We now sample from the model, using classifier-free guidance with a gamma value of 0.5.

Pictured below is the algorithm used for sampling:

Pictured below are results from sampling at epochs 5 and 20 respectively.

CS 180

Project 5:

Fun With Diffusion Models

November 19, 2024

Ayra Jafri

Table of Contents