by Ziran Zhou
In this project, I followed the instructions and the background introduction of the spec and explored how to use diffusion models to operate on and generate various images, and also implemented the models myself.
This first part I worked with DeepFloyd IF Difussion model to implement difussion sampling iteratively and used it for impainting and optical illusion generation.
To start, set up the pretrained model (DeepFloyd) as a two stage difussion model with teh 1st stage creating 64x64 image and 2nd 256x256 with pixels as units, indicating the subsequent image quality difference. Then, model can be sampled with different number of inference steps where it determines the number of denoising steps to take, i.e., th ehigher the inference step, the higher image quality.
Below are some images generated by stage 1 and stage 2 at 20 inference stpes, and we can see the quality and size difference between the stages, but also with a limited inference step, the complexity of the result is low.
Generated Images with Stage 1 Inference 20:
Generated Images with Stage 2 Inference 20:
Below I generated of the same prompt over both stages but with inference step of 100, and we can clearly see they are much more detailed.
Generated Images with Stage 1 Inference 100:
Generated Images with Stage 2 Inference 100:
Here, I use the pretrained DeepFloyd denoisers to operate on a clean image \(x_0\) then iteratively add noise, by the amount determined by coefficient \(\alpha_t\) and \(T=1000\), to it to get a image of pure noise \(x_t\). Then use a diffusion model to reverse the process by predicting the noise and denoising the image, i.e., given \(x_t\) we predict the noise and remove it from it, and iteratively, we hope to get something resembling of the original input clean image \(x_0\).
We sample the noise from the distribution \(q(x_t|x_0) = N(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t)I)\), which is equivalent to computing \(x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \quad \text{where} \quad \epsilon \sim N(0, 1)\).
Below are noisy images generated under various values of \(t\).
Original Image
Noisy Image when \(t\)=250
Noisy Image when \(t\)=500
Noisy Image when \(t\)=750
I applied Gaussian Blur Filter (kernel size of 7 and sigma of 2) to the noisy images generated in the previous part as the conventional denoising approach and the results are shown below:
Noisy Image when \(t\)=250
Gaussian Blurred Image when \(t\)=250
Noisy Image when \(t\)=500
Gaussian Blurred Image when \(t\)=500
Noisy Image when \(t\)=750
Gaussian Blurred Image when \(t\)=750
The results look like they are just blending the noises into smoother versions instead of removing them, which is why conventional Gaussian Blurring approach is not the most effective one.
.
Per the instructions, I then used a pretrained UNet denoiser of DeepFloyd to estimate the Gaussian noise from the image given a timestep \(t\) and at each step, remove the noise from the image, till we eventually arrive at an image resembling the original image, though some difference is expected due to the noise estimations is not perfectly accurate. The model is needs a prompt and we input "a high quality photo" to train it with text conditions.
Noisy Image when \(t\)=250
One-Step Denoised Image when \(t\)=250
Noisy Image when \(t\)=500
One-Step Denoised Image when \(t\)=500
Noisy Image when \(t\)=750
One-Step Denoised Image when \(t\)=750
The results are much better than teh previous conventional Gaussian Blur denoising, though the more the noises in the input image, the more different from teh original the denoised result is.
Below is a line up of all the results from before.
Original Image
Noisy Image when \(t\)=250
Gaussian Blurred Image when \(t\)=250
One-Step Denoised Image when \(t\)=250
Original Image
Noisy Image when \(t\)=500
Gaussian Blurred Image when \(t\)=250
One-Step Denoised Image when \(t\)=500
Original Image
Noisy Image when \(t\)=750
Gaussian Blurred Image when \(t\)=750
One-Step Denoised Image when \(t\)=750
To improve the denoising through difussino model, I followed the instruction to train the models to denoise iteratively over numerous steps, and iteratively denoise one step at a time till timestep of \(T=1000\), but since it is too computationally costly, I skipped steps and use strided timesteps starting at \(t=990\) (or \(t=999\)) and taking strides of 30 timesteps per step till \(t=0\) when we arrive at the clean image denoised result. The other parameters are defined as required on the spec website.
We also assign the next-step denoised image \(x_{t'}\) at timestep \(t'\) using the previous-step denoised image \(x_t\) through the following formula defines: \[x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'} \beta_t} x_0 + \sqrt{\alpha_t (1 - \bar{\alpha}_{t'})}x_t}{1- \bar{\alpha}_t} + v_{\sigma}\]
Below are the resulting images of the purely noisy image iteratively being denoised from timestep of 990, also the iteratively-denoised image result compard with previous parts's images.
\(t=30\)
\(t=90\)
\(t=240\)
\(t=390\)
\(t=540\)
\(t=690\)
Original Image
Iteratively Denoised Image
One-Step Denoised Image
Gaussian Blurred Denoised Image
We see that the Iteratively Denoised image is much more detailed than the One-Step Denoised image, and resembles the original image more in terms of structure and also pixel details, indicating that the iteratively denoising process restores more detail during denoising.
.
Unlike previous part, here i start with pure random noise to let the model denoise it and generate image from scratch, and the results are shown below:
The images generated have rather poor quality, though does resemble some realistic image we would have seen in real life.
.
Classifier-Free Guidance denoising is implemented in this subpart. I compute the conditional noise estimate \(\epsilon_c\) (with prompt as usual, "a high quality photo") and unconditional noise estimate \(\epsilon_u\) (with prompt ""), tehn generating new noise would become \(\epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u)\).
The CFG-generated sample results are shown below. They look to be more vibrant and seem to present more details than the previous part, also resemble something real-life image.
Apply SDEdit algorithm by mapping and adding noises to an image and force it back to the image
manifold but do not do any conditioning. The parameter i_start
is used to indicate the
number of iterative denoising steps away from the starting pure noise image for the denoising
process to begin, and since we are subtracting noises from noisy images incrementally across the
strided steps, starting from the noisy image (in this case that of the test_im
), and
i_start
determines how far into the denoising increment to begin the Iterative
Denoising process.
Thus, smaller values of i_start
mean the closer to \(T=1000\), i.e., more noises in the
input to start iterative denoising, and vice versa.
Set the i_start
values as [1, 3, 5, 7, 10, 20], then produce the results as shown
below:
SDEdit (i_start=1
)
SDEdit (i_start=3
)
SDEdit (i_start=5
)
SDEdit (i_start=7
)
SDEdit (i_start=10
)
SDEdit (i_start=20
)
Campanile Clean Image
SDEdit (i_start=1
)
SDEdit (i_start=3
)
SDEdit (i_start=5
)
SDEdit (i_start=7
)
SDEdit (i_start=10
)
SDEdit (i_start=20
)
San Francisco Sunset Clean Image
SDEdit (i_start=1
)
SDEdit (i_start=3
)
SDEdit (i_start=5
)
SDEdit (i_start=7
)
SDEdit (i_start=10
)
SDEdit (i_start=20
)
Deer Clean Image
We can see from these 3 examples, that the SDEdit approach works well in generating various degrees of restoration from pure noise.
I also implemented SDEdit on Web Image (Shai-Hulud),
and it is apparent that it work well, especially when it is close to the original image, i.e., large
i_start
value.
SDEdit (i_start=1
)
SDEdit (i_start=3
)
SDEdit (i_start=5
)
SDEdit (i_start=7
)
SDEdit (i_start=10
)
SDEdit (i_start=20
)
Dune Shai-Hulud Clean Image
I also implemented SDEdit on the hand-drawn images of a Waxing Moon with Clouds, and Several Palm Trees and Hollywood Sign. As shown below.
SDEdit (i_start=1
)
SDEdit (i_start=3
)
SDEdit (i_start=5
)
SDEdit (i_start=7
)
SDEdit (i_start=10
)
SDEdit (i_start=20
)
Hand Drawn Moon Image
SDEdit (i_start=1
)
SDEdit (i_start=3
)
SDEdit (i_start=5
)
SDEdit (i_start=7
)
SDEdit (i_start=10
)
SDEdit (i_start=20
)
Hand Drawn Hollywood Image
Inpainting starts with a original image and a boolean binary mask that keeps regions i want to
generate over, and that new image will be generated over the region where m=1
and other
parts are kept as they have m=0
. The model iteratively implements the diffusion loop
and obtains \(x_{t}\), we force it to have the same pixels as \(x_{\text{orig}}\) where
m=0
, that is, by the formula: \[ x_t \leftarrow \textbf{m}x_t +
(1-\textbf{m})\text{forward}(x_{\text{orig}}, t) \], and we get the results shown above (I intended
to replace and inpaint the buildings in the background of the Campanile, as shown).
Clean Original Image
Mask
To Replace (Inpaint Over)
Inpaint Result
Clean Original Image
Mask
To Replace (Inpaint Over)
Inpaint Result
Clean Original Image
Mask
To Replace (Inpaint Over)
Inpaint Result
Now we can change the prompt from "a high quality photo" to another one in the prompt dictionary to guide the projection down a path aligning with the text prompt corresponding features, and below are the results:
SDEdit (i_start=1
)
SDEdit (i_start=3
)
SDEdit (i_start=5
)
SDEdit (i_start=7
)
SDEdit (i_start=10
)
SDEdit (i_start=20
)
Campanile Original Image
SDEdit (i_start=1
)
SDEdit (i_start=3
)
SDEdit (i_start=5
)
SDEdit (i_start=7
)
SDEdit (i_start=10
)
SDEdit (i_start=20
)
Arthur Sword Original Image
SDEdit (i_start=1
)
SDEdit (i_start=3
)
SDEdit (i_start=5
)
SDEdit (i_start=7
)
SDEdit (i_start=10
)
SDEdit (i_start=20
)
Whale Original Image
To implement optical illusion generation, I denoised the image \(x_t\) with one prompt to calculate \(\epsilon_1\ = \text{UNet}(x_t, t, p_1) \), but also flip the image \(x_t\) vertically and denoised it with a another prompt to calculate \(\epsilon_2\ = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \) for noise approximations. And we can get the eventual noise by calculating the average, but since \(\epsilon_2\) if calculated for the flipped image, then we flip back the \(\epsilon_2\) for teh averaging \( \frac{\epsilon_1 + \epsilon_2}{2} \).
I generated teh anagrams results shown below:
Combination from an oil painting of an old man & an oil painting of people around a campfire
Combination from a photo of the amalfi cost & a lithograph of waterfalls
Combination from a lithograph of a skull & an oil painting of a snowy mountain village
Use difussion models to hybridize images looking like one thing close up and another thing far away (simulated by blurring with Gaussian).
This is achieved by the noise estimate from 2 different prompts, similar to the previous part, and combine the low frequencies of one noise with the high frequencies of the other.
Hybrids of Skull & Waterfall
Hybrids of Rocket and Pencil
Hybrids of Hipster Barrista and Amalfi Coast
.
.
.
.
This part of the project I trained a diffusion model on MNIST dataset to generate images of MNIST digits and achieved by training UNet to do single-step denoising and train the UNet to ietratively denoise by adding time conditioning and class conditioning to reverse the effects of noise on image data.
I start by building a simple one-step denoiser (UNet) tasked with mapping noisy images back to their clean states and included downsampling and upsampling blocks with skip connections to preserve crucial image features in trainng processes. The UNet input has is an imae and we add some level of noise to it then output a prediction of what the denoised image would look like, and the MNIST dataset has 28x28 pixel black and white images of digits. Below is the diagram illustrating the workflow of the UNet.
The UNet was trained using L2 loss function minimizing difference b/w denoised output and original image which prepares teh model for more complex operations involved varied degrees of noise.
I then implemented the Feature Descriptor Extraction for each feature point found in the previous part to facilitae accurate image matching. I 1st convert the input image to grayscale if it is not already to simplify feature descriptor extraction by reducing data dimensionality and use corner points identified from previous ANMS result as centers for feature descriptor extraction.
The denoiser \( D_{\theta}(z) \) aims to minimize L2 loss b/w denoised image & clean original image, with the loss function as \( L = \mathbb{E}_{z,x} \Vert D_{\theta}(z) - x \Vert^2 \) where \( \mathbb{E} \) is expectation over dist. of clean images \(x\) and their corresponding noisy version \(z\).
Noisy training pairs \((z,x)\) were generated by applying random noise vector \( \epsilon \) scaled by \(\sigma\) a predetermined noise level to clean images \(x\): \( z = x + \sigma\epsilon \) where the formula was fundamental in simulating various noise to enhance UNet robustness.
The UNet was adapted to predict the noise \(\epsilon\) which is added to the clean original image effectively making the learning of the reverse of intial noise mapping process: \( L = \mathbb{E}_{\epsilon,z} \Vert \epsilon_{\theta}(z) - \epsilon \Vert^2 \)
And this facilitaed diffusion as a schedule of noise levels being introduced, the model learned to adjust predicted noise based on specific step in teh diffusion process, improving single-step denoising, allowing the model to handle broader range of noise intensities effectively.
Below is the noises added iteratively over various sigma values (\([0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]\)) for all the digits:
I run the model over several epochs of training, where under each we generate new noisy images to ensure diverse range of challenges for the denoiser for training and I also chose to denoise noisy MNIST digit \(z\) generated using teh above mentioned formula with \(\sigma=0.5\) on \(x\), and also fine-tune the model by setting hidden dimensions \(D=128\), batch size of 256, over 5 epochs, and an Adam optimizer using mean squared error as Loss function and a learning rate of \(1 \times 10^{-4}\), and teh result of the losses are recorded, as are the gradient descent steps we took, as shown below.
After specific training intervals where I chose 1st and 5th epochs, the denoised images were visualized below. And we clearly see that the denoised images after epoch 5 are clearer and more in-line with the original image, indicating more success in the denoising process with higher epochs.
Denoised Images after Epoch 1
Denoised Images after Epoch 5
After, I performed the denoising on sigma values that were not trained for before and performed the denoising and got the following results, and we can see that the more noises there are, the harder it is for us to restore it to original (for now).
.
The UNet was conditioned on timestep \(t\) requiring me to modify the network to include timestep embeddings so I can fully connect the layers (FCBlocks) that integrated the timestep information into the network, allowing the model to adjust based on diffusion progress.
The UNet architecture is also modified so that it can condition our input on \(t\) by adding fully connected blocks (FCBlocks) to it, as shown below (from project spec).
And based on the following pseudocode from the spec:
I defined the important parameters including FCBlock
which is used to inject the
timestep \(t\) info into the UNet which are normalized and passed through these blocks to modulate
features throughout network layers. This allows prediction of noise \(\epsilon\) mapped into the
original clean image and we do this successively and cumulatively so that we end up with an image of
pure noise so generating a well-restored image \(x\) becomes possible through the reverse,
iteratively and cumulatively denoising (gradient descent \( \nabla_\theta \|\epsilon -
\tilde{\epsilon}\|^2 )\), as shown by the mathematical formula: \[x_t =
\sqrt{\alpha_t}
x_0 + \sqrt{1 - \alpha_t} \epsilon\].
I start from \(x_0\) and for each timestep \(t \in \{0,1,\cdots,T\}\), noise is mapped and added to the previous image, and eventually generate a noisy image \(x_t\).
And with the modification in-place, I train the new model with many fine-tuned parameters, including hidden dimensions \(D=64\), batch size of 128, over 20 epochs, and an Adam optimizer using mean squared error as Loss function and a learning rate of \(1 \times 10^{-3}\), decaying as scheduled as specified by the parameter \(\gamma = 0.1^{(1/\text{number of epochs})} = 0.1^{\frac{1}{20}}\), and the result of the losses are recorded, as are the gradient descent steps we took, as shown below.
Then I sample from the new model per the mathematical formula workflow as shown below.
The process starts iwth a completely noised image \(z_T ~ N(0,I)\) that ensures it does not resemble the original data, and progressively start from \(t=T\) down to \(t=0\), apply the UNet model \(\epsilon_{\theta}\), then predicts the noise \(\epsilon\) that needs to be subtracted from current noisy image, so to denoise it through the new architecture to retrieve a clean (best-restored) image that resembles the original data. I also implemented the variance of the noise at each timestep \(\sigma(t)\) determining noise amount the model expects to eliminate.
Then, like the previous part, I run the model over 20 epochs of training, and monitored the results at key epochs 1, 5, 10, 15, 20 to assess improvements and refine the denoising abilities and at each epoch, fresh batches of noisy MNISt digits were selected using the formula \(z = x+0.5\epsilon\), giving the model slight difficulties to overcome through robust learning, and the results are displayed below.
Epoch 1
Epoch 5
Epoch 10
Epoch 15
Epoch 20
The UNet was conditioned not only on timestep \(t\) but now also on the class labels of the images, which is especially useful for datasets with distinct categories like MNIST \( (0, ..., 9) \).
Two more FCBlocks are added to the UNet architecture to include class information alongside timesteps and by using one-hot encoding to integrate class labels into the diffusion process to guide the denoisingsteps more accurately.
Dropout is also implemented to prevent model from overly relying on class info, by using where the class embedding is occasionally set to 0 during training, simulating scenarios where class info might be ambiguous, approximating to 10% of the time to drop class conditioning.
The algorithm is similar to the one before but with the additional one-hot encoding included, as shown below:
The loss is calculated mostly similar to the previous part. It is adapted to include class info but also time conditioning, enhancing teh model's specificity in handling different categories of input data.
The new element in the loss fuction also integrates a class-conditioned vector, combining time conditioning, and thi saddition allows noise prediction to be based on diffusion timestep and class of digit, expressed by \(L = \mathbb{E}_{\epsilon, z, t, c} \Vert \epsilon_{\theta}(z, t, c) - \epsilon \Vert^2 \), where \(c\) represents the class of the digit. And with the implementation, below is the loss versus gradient descent steps plot.
Then I sample from the new model per the mathematical formula workflow as shown below.
The process is mostly similar to the previous part, but with the added one-hot vector for class labels CFG scale (\(\gamma\)), which adjusts noise prediction by blending unconditioned and class-conditioned predictions (\(\epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u)\)), allowing model to generate images more aligned with teh specific digit classes.
Then, like the previous part, I run the model over 20 epochs of training, and monitored the results at key epochs 1, 5, 10, 15, 20 to assess improvements and refine the denoising abilities and at each epoch, and the results are displayed below. And we see that because we have conditioned on classes, the digits are all selected in order, unlike teh part before which is not.
Epoch 1
Epoch 5
Epoch 10
Epoch 15
Epoch 20
I learned many things about diffusion models and I really enjoyed drawing the images to see how the algorithm be able to recognize the object I wanted, and I also liked how the different prompts and their combinations create various interesting pictures. I also enjoyed implementing and filling in the model to see how MNIST digits are denoised to restore to resemble original clean images.