Preserving Image Properties Through Initializations in Diffusion Models
WACV 2024
- 1 Revery AI Inc.
- 2 University of Illinois at Urbana-Champaign
Abstract
Retail photography imposes specific requirements on images. For instance, images may need uniform background colors, consistent model poses, centered products, and consistent lighting. Minor deviations from these standards impact a site’s aesthetic appeal, making the images unsuitable for use. We show that Stable Diffusion methods, as currently applied, do not respect these requirements. The usual practice of training the denoiser with a very noisy image and starting inference with a sample of pure noise leads to inconsistent generated images during inference. This inconsistency occurs because it is easy to tell the difference between samples of the training and inference distributions. As a result, a network trained with centered retail product images with uniform backgrounds generates images with erratic backgrounds. The problem is easily fixed by initializing inference with samples from an approximation of noisy images. However, in using such an approximation, the joint distribution of text and noisy image at inference time still slightly differs from that at training time. This discrepancy is corrected by training the network with samples from the approximate noisy image distribution. Extensive experiments on real application data show significant qualitative and quantitative improvements in performance from adopting these procedures. Finally, our procedure can interact well with other control-based methods to further enhance the controllability of diffusion-based methods.
Video
Overview
Observation
The fashion images generated by diffusion models do not look like the training distribution.
PCA-K Offset Training + Inference
Current state-of-the-art methods, such as DDIM, add noise to the image throughout all training steps. This approach assumes that initializing with pure noise during inference yields a similar outcome due to the small value of αt. Yet, it will create inconsistent image distributions. To address this discrepancy, we use the same initialization of inference procedure during the 1st timestep for training, which leads to better text control.
Results
Qualitatively, Mean Offset Training and Mean Offset Inference provides better text control because the relationship between initialization and text is preserved during training and inference.
Application to ControlNet
We apply our method to ControlNet for the task of virtual try-on using our dataset. We mask out the region of a garment from the person and adapt ControlNet to take the masked garment image as the condition.
Citation
Website template credit: Michaël Gharbi and Jon Barron