Mike image + mask
|
Inpainted result for Mike
|
Blackwell Hall img + mask
|
Inpainted result for Blackwell Hall
|
Part 1.7.3: Text-Conditional Image-to-image Translation
The logic of this section is essentially the same as 1.7.2, but instead we guide the projection with a text prompt by changing the prompt embedding. As you can see below, the image becomes more like what it should as the noise level increases.
Campanile
|
Blackwell Hall
|
Salesforce Tower
|
Part 1.8: Visual Anagrams
This was the coolest section to implement because we’re creating an optical illusion. At one orientation, the image looks like something, and then when it is flipped, it’s something entirely different. Once again, we use the logic behind iteratively denoising with cfg. Since we are creating two images in one, we need two conditional and unconditional prompt embeddings each. We use a pair containing one each to run the Unet twice and calculate that specific noise estimate, and then do it again on the second pair. When calculating these, we use the below formula. Most importantly, we must flip the image for the second Unet and then flip the outputs back. Once we have our two noise estimates, we calculate cfg noise estimate of each and then average those. That will be the noisy estimate for the image calculation.
Part 1.8's Algorithm
|
"an oil painting of an old man"
|
"an oil painting of people around a campfire"
|
"a lithograph of a skull"
|
"an oil painting of a snowy mountain village"
|
"a photo of a dog"
|
"a photo of a hipster barista"
|
Part 1.9: Hybrid Anagrams
In this section, our goal is to trick the viewer into seeing a certain image up close and a different image further away. Once again, we use the cfg noise estimate from two pairs of conditional and unconditional prompt embeddings. To pull off this trick, for the image we want to see up close, we use a low pass filter and then for the one we want to see further away, we use a high pass filter. We combine these filters to get a new noise estimate and then continue our code. The formula is also below for reference.
Part 1.9's Algorithm
|
"a lithograph of a skull" and "a lithograph of waterfalls"
|
"a rocket ship" and "a pencil"
|
"a lithograph of a skull" and "a photo of a dog"
|
Part B: Diffusion Models from Scratch!
Part 1: Training a Single-Step Denoising UNet
In this part, our goal is to train a denoiser such that we can take in a noisy image and obtain a clean image by optimizing the L2 Loss. The Unet architecture has a lot of operations that we needed to implement such as ConvBlock and UpBlock. These operations had to be put in a specific order with the appropriate number of input and output channels. Once that was completed, we ran 5 epochs to train the model and all the results are below.
|
Training Loss Curve Plot
|
Results after epoch 0
|
Results after epoch 5
|
Below is the test digit on varying levels of noises. The last one is for when it is at 1.0, the screenshot was cut off.
Part 2: Training a Diffusion Model
First, we add time conditioning to the Unet, which involves adding two FCBlocks based on t. We then add these results to the appropriate block based on the image from the spec. At this point, when we sample, we only include the unconditional value for epsilon. In part 2.4, we add class-conditioning to the Unet, which requires adding two more FCBlocks that use the c paramter. Essentially we want a 10% dropout rate, where c gets set to zeros. We multiply the result of the FCBlock before adding the time conditioning result from the last part.
Training Loss Curve Plot
|
Results after epoch 0 (top) & epoch 5 (bottom)
|
After implementing class conditioning results below.
Training Loss Curve Plot
|
Results after epoch 0 (top 4 rows) & epoch 5 (bottom 4 rows)
|