Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

Concept Sliders can be trained on text prompts, image pairs, or StyleGAN stylespace neurons to identify targeted concept directions in diffusion models for precise attribute control.

Why allow concept control in diffusion models?

The ability to precisely modulate semantic concepts during image generation and editing unlocks new frontiers of creative expression for artists utilizing text-to-image diffusion models. As evidenced by recent discourse within artistic communities, limitations in concept control hinder creators' capacity to fully manifest their vision through these generative technologies. It is also expressed that sometimes these models generate blurry, distorted images

Modifying prompts tends to drastically alter image structure, making fine-tuned tweaks to match artistic preferences difficult. For example, an artist may spend hours crafting a prompt to generate a compelling scene, but lack ability to softly adjust lighter concepts like a subject's precise age or a storm's ambience to realize their creative goals. More intuitive, fine-grained control over textual and visual attributes would empower artists to tweak generations for nuanced refinement. In contrast, our Concept Sliders enables nuanced, continuous editing of visual attributes by identifying interpretable latent directions tied to specific concepts. By simply tuning the slider, artists gain finer-grained control over the generative process and can better shape outputs to match their artistic intentions.

How to control concepts in a model?

We propose two types of training - using text prompts alone and using image pairs. For concepts that are hard to describe in text or concepts that are not understood by the model, we propose using the image pair training. We first discuss training for Textual Concept Sliders.

Textual Concept Sliders

The idea is simple but powerful: the pretrained model P_θ*(x) has some pre-existing probability distribution to generate a concept t, so our goal is to learn some low-rank updates to the layers of the model, there by forming a new model P_θ(x) that reshapes its distribution by reducing the probability of an attribute c_- and boost the probability of attribute c₊ in an image when conditioned on t, according to the original pretrained model:

This is similar to the motivation behind compositional energy-based models. In diffusion it leads to a straightforward fine-tuning scheme that modifies the noise prediction model by subtracting a component and adding an component conditioned on the concept to target:

Our Concept Slider fine tunes a low rank adaptor using the conditioned scores obtained from the original frozen Stable Diffusion (SD) model, to guide the output away from an attribute and towards another for a target concept being edited.

We query the frozen pre-trained model to predict the noise for the given target prompt, and control attribute prompts, then we train the edited model to guide it in the opposite direction using the ideas of classifier-free guidance at training time rather than inference. We find that fine-tuning the slider weights with this objective is very effective, producing a plug-and-play adaptor that directly controls the attributes for the target concept

In practice, we notice that the concepts are entangled with each other. For instance, when we try to control the age attribute of a person, their race changes during inference. To avoid such undesired interference, we propose using a small set of preservation prompts to find the direction. Instead of defining the attribute with one pair of words alone, we define it by using multiple text compositions, finding a direction that changes the target attribute while holding other attribute-to-preserve constant.

To avoid undesired interference with the edits and allow precise control, we propose finding directions that preserve a set of protected concepts. For example instead of finding the direction from "young person" to "old person", we find a direction that preserves race by particularly mentioning a set of protected attributes to preserve, like "Asian young person" to "Asian old person".

The arrow in the red is the original age direction trained using just "old" and "young" prompts. However, the direction is entangled with race. Instead we build a new disentangled direction (in blue) using multiple prompts to exclusively make the new vector invariant in those directions. For example, "asian old person" and "asian young person". We do that with all the races for race disentanglement.

Visual Concept Sliders

To train sliders for concepts that can not be described with text prompts alone, we propose image pair based training. We particularly train the image based on gradient difference. The sliders learn to capture the visual concept through the contrast between image pairs (x_A , x_B ). Our training process optimizes the LORA applied in both the negative and positive directions. We shall write ε_θ₊ for the application of positive LoRA and ε_θ_- for the negative case. Then we minimize the following loss:

Why are Concept Sliders Low Rank and Disentangled?

We introduce low-rank constraints to our sliders for two main reasons. First, for efficiency in parameter count and computation. Second to precisely capture the edit direction with better generalization. The disentangled formulation helps isolating the edit from unwanted attributes. We show an ablation study to better understand the role of these two main components of our work.

The disentanglement objective helps avoid undesired attribute changes like change in race or gender when editing age. The low-rank constraint is also essential for enabling a precise edit.

Sliders to Improve Image Quality

One of the most interesting aspects of a large-scale generative model such as Stable Diffusion XL is that, although their image output can often suffer from distortions such as warped or blurry objects, the parameters of the model contains a latent capability to generate higher-quality output with fewer distortions than produced by default. Concept Sliders can unlock these abilities by identifying low-rank parameter directions that repair common distortions.

The repair slider enables the model to generate images that are more realistic and undistorted. The parameters under the control of this slider help the model correct some of the flaws in their generated outputs like distorted humans and pets in (a, b), unnatural objects in (b, c, d), and blurry natural images in (b,c)

We demonstrate the effect of our "repair" slider on fine details: it improves the rendering of densely arranged objects, it straightens architectural lines, and it avoids blurring and distortions at the edges of complex shapes.

We demonstrate a slider for fixing hands in stable diffusion. We find a direction to steer hands to be more realistic and away from "poorly drawn hands".

Controlling Textual Concepts

We study Textual Concept Sliders; our paper includes more quantitative analysis comparing previous image editing methods and text-based prompt editing methods.

By using a small set of textual descriptions of the attributes to control, Concept Sliders can be trained to allow finegrained control of generated images during inference. By scaling the slider factor, users can control the strength of the edit.

We show how several attributes of an image can be controlled using different sliders. We note that due to the low-rank formulation, the parameters are light weight, easy to share, and plug.

We demonstrate weather sliders for "delightful", "dark", "tropical", and "winter". For delightful, we notice that the model sometimes make the weather bright or adds festive decorations. For tropical, it adds tropical plants and trees. Finally, for winter, it adds snow.

We demonstrate style sliders for "pixar", "realistic details", "clay", and "sculpture".

Controlling Visual Concepts

Nunanced visual concepts can be controlled using our Visual Sliders; our paper shows comparisons with customization methods and some quantitative evaluations.

Sliders can be created for concepts that can not be described in words. These sliders are created by artists by using 6-8 pairs of images.

StyleGAN latents, especially the stylespace latents, can be transferred to Stable Diffusion. We collect images from styleGAN and train sliders on those images. We find that diffusion models can learn disentangled stylespace neuron behavior enabling artists to control nuanced attributes that are present in styleGAN.

Stylespace latents can be transferred from styleGAN to Stable Diffusion XL.

Composing Multiple Sliders

A key advantage of our low-rank slider directions is composability - users can combine multiple sliders for nuanced control rather than being limited to one concept at a time. By downloading interesting slider sets, users can adjust multiple knobs simultaneously to steer complex generations

We show blending "cooked" and "fine dining" food sliders to traverse this 2D concept space. It is interesting how the model makes portion sizes small for "fine dining".

We qualitatively show the effects of composing multiple sliders progressively up to 50 sliders at a time. We use far greater than 77 tokens (the current context limit of SDXL) to create these 50 sliders. This showcases the power of our method that allows control beyond what is possible through prompt-based methods alone.

How to cite

The preprint can be cited as follows.

bibliography

Rohit Gandikota, Joanna Materzyńska, Tingrui Zhou, Antonio Torralba, David Bau. "Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models" arXiv preprint arXiv:2311.12092 (2023).

bibtex


@article{gandikota2023sliders,
  title={Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models},
  author={Rohit Gandikota and Joanna Materzy\'nska and Tingrui Zhou and Antonio Torralba and David Bau},
  journal={arXiv preprint arXiv:2311.12092},
  year={2023}
}