Adam Lucek - Flux model for Open AI generated image gen
https://www.youtube.com/watch?v=Drw6tnvtA5I This detailed summary covers the key aspects of the video, including the goal, the AI models and techniques used, the tools and setup process, and the results.
Video Title: Training FLUX.1 (Black Forest Labs) LoRA Adapter on Faces | Low VRAM Image Generation Main Goal: To demonstrate how to train AI image generation models on faces to reliably recreate pictures of a specific person (the speaker, Adam Lucek, in this case) in various scenarios. Key Technologies & Concepts:
-
FLUX.1 by Black Forest Labs: A 12 billion parameter rectilinear flow transformer capable of generating high-quality images from text descriptions. Considered a leading open-source image generation model. Has generated significant community interest and commercial applications (e.g., Photo AI .com).
-
LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning method. Adds additional, small weights (the LoRA adapter) on top of a pre-trained foundation model (FLUX.1). Only these small added weights are trained, making the process very efficient and requiring significantly less VRAM compared to full model fine-tuning. The trained LoRA adapter is a small file that can be plugged in or out of the main FLUX.1 model during inference.
-
DreamBooth Methodology (Inspired): A technique to personalize text-to-image diffusion models. Involves training with a limited number of input images (typically 3-5, Adam used 10) of a specific subject. A “unique token combination” or “trigger phrase” (e.g.,
[trigger]) is used during training to associate the subject’s features with that specific phrase. This allows the model to “hyper-fixate” on activating specific neurons when the trigger is used, reliably recreating the subject.
Tools & Resources:
-
**ai-toolkit**GitHub Repository: Primary Training Tool:github.com/osiris-ai-toolkit/ai-toolkitAdvantages: Designed for efficient training, particularly with low VRAM environments. Unlike other tools (e.g., Diffusers’ DreamBooth example), which might require >40GB VRAM (Adam experienced 72GB on an H100 for basic FLUX training),ai-toolkitcan perform LoRA training on FLUX.1 using only 24GB of VRAM (e.g., on an RTX 4090), thanks to advanced quantization methods. Setup: Clone the repository:git clone https://github.com/osiris-ai-toolkit/ai-toolkit.gitNavigate:cd ai-toolkitUpdate submodules:git submodule update --init --recursiveCreate/Activate virtual environment:python3 -m venv venv,source venv/bin/activateInstall PyTorch:pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu121Install requirements:pip3 install -r requirements.txt -
ComfyUI: Primary Inference Tool: A powerful and modular diffusion model GUI (Graphical User Interface). Uses a node-based workflow for advanced stable diffusion pipelines. Supports FLUX models. Setup: Can be downloaded as a standalone build (Windows) or installed manually (Linux/Windows). Adam provides a JSON workflow file (
workflow_adamlora.json) on his Hugging Face model page to easily recreate his inference setup.
Training Process (Step-by-Step):
-
Data Preparation: Gather 10 PNG images of the subject (Adam used selfies with varied angles and settings). For each image, create a corresponding
.txtfile with the exact same name (e.g.,image_1.pngandimage_1.txt). Inside each.txtfile, write a simple descriptive prompt for the image, including the placeholder[trigger]where the personalized token will be inserted (e.g.,[trigger] in blue graduation clothes). Place all images and text files into a dedicated folder (e.g.,lora_me) within theai-toolkitdirectory. -
Configuration File Setup (YAML): Copy the example FLUX LoRA training config file:
ai-toolkit/config/examples/train_lora_flux_24gb.yaml. Rename it and move it to the rootai-toolkitdirectory (e.g.,train_lora_flux_face.yaml). Edit the YAML file:name: Your desired model name (e.g., “flux_lora_face”).training_folder: Where output (checkpoints, samples) will be saved (e.g., “output/flux_lora_face”).performance_log_every: 200 (to view training stats every 200 steps).device: “cuda:0” (for the primary GPU).trigger_word: Your unique trigger phrase (e.g., “4d4m luc3k” - Adam’s leetspeak name). This will replace[trigger]in your image prompts.network type: “lora”.linear: 32 (LoRA rank; higher for more complexity like faces, but increases resource usage).save_every: 200 (save a checkpoint every 200 steps).max_step_saves_to_keep: 4 (keep only the last 4 checkpoints to save disk space).push_to_hub:false(Adam recommends manual upload).folder_path: Path to your image/text data folder (e.g., “./lora_me”).caption_ext: “txt”.shuffle_tokens:true(to diversify training with limited data).resolution: 512, 768, 1024 (FLUX handles multiple resolutions).steps: 2000 (total number of training steps).lr: 4e-4 (learning rate, Adam adjusted from default).skip_first_sample:true(prevents generating very poor initial samples).quantize:true(enables 8-bit mixed precision for lower VRAM).model_name_or_path: “black-forest-labs/FLUX.1-dev” (specifies the base model). Sample Prompts: Modify these to include yourtrigger_wordto generate validation images during training, allowing you to monitor progress. -
Running the Training Script: Ensure you have been granted access to the FLUX.1 model on Hugging Face and have logged in via the
huggingface-clitool. Execute the training command:python run.py train_lora_flux_face.yaml(using your YAML filename). The script will download necessary components, preprocess images, and begin training. Training 2000 steps on an A100 (40GB VRAM) takes approximately 1 hour and 5 minutes, using ~22GB VRAM.
Results & Observations:
- The
ai-toolkittraining process saves validation images at specified intervals, allowing visual monitoring of training progress. - Initial generations may be poor, but quality improves over steps.
- Over-training can occur: Beyond a certain point, the general image quality can degrade (e.g., blurry faces), even if the specific facial features become more accurate. This is because the LoRA adapter might start to “over-specialize” and lose the base model’s generalization capabilities. This is why reducing the LoRA adapter strength during inference can be beneficial.
- The final trained LoRA adapter is a
.safetensorsfile (e.g.,flux_lora_adam.safetensors), which is relatively small (Adam’s was 343.8 MB).
Inference with ComfyUI:
- Load Workflow: Open ComfyUI and load the provided JSON workflow file (
workflow_adamlora.json). - Load Models: The workflow loads the base FLUX.1 model (e.g.,
flux-1-dev.safetensors). It then loads the trained LoRA adapter (e.g.,flux_lora_adam.safetensors). - Adjust LoRA Strength: The
Load LoRAnode includesstrength_modelandstrength_clipparameters. Crucial for quality: If the trained model produces degraded results, reducing these strengths (e.g., 0.70 for model, 1.00 for clip) allows the base FLUX.1 model to contribute more to the overall image quality while still retaining the personalized features from the LoRA adapter. This combats over-training. - Input Prompt: Enter your desired prompt, including the specific
trigger_wordyou trained with (e.g.,4d4m luc3k as a fancy chef in a fancy restaurant...). - Generate: Click “Queue Prompt” to generate the image.
- Results: ComfyUI generates the personalized image. Adam shows examples where initial attempts had issues (e.g., hands, background), but after adjusting LoRA strength (e.g., to 70% for the model and 100% for the clip), the output is significantly improved and photo-realistic.
Overall Conclusion:
Training personalized AI image generation models with FLUX.1 and LoRA adapters using tools like ai-toolkit is efficient and yields impressive results even with limited hardware and training data. The ability to fine-tune LoRA adapter strength during inference in platforms like ComfyUI is key to optimizing output quality and preventing over-specialization artifacts.
(Self-correction during the process: Initially, I might just list the tools. But realizing the “detailed” request, I’d go back and add why each tool is important or what its specific advantage is. For instance, _ai-toolkit_’s low VRAM usage is a critical detail. Similarly, simply stating “data preparation” isn’t enough; explaining the image count, text file naming, and _[trigger]_ placeholder is crucial for someone trying to follow along.) *(Also, ensure to explicitly mention the overtraining issue and the solution of reducing LoRA strength during inference, as this is a key takeaway from the video’s problem-solving aspect.)*The video provides a comprehensive guide on training personalized AI image generation models using the FLUX.1 foundation model and LoRA adaptation, emphasizing efficiency for consumer hardware.
Key Concepts and Technologies:
- FLUX.1 (Black Forest Labs): A 12-billion-parameter rectilinear flow transformer renowned for high-quality text-to-image generation. It serves as the powerful base model for fine-tuning.
- LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning method. Instead of training the entire large model, LoRA adds small, trainable weight matrices (the LoRA adapter) that sit on top of the base model. Only these small matrices are updated during training, making the process much faster and less VRAM-intensive.
- DreamBooth Methodology: The approach used for training, inspired by DreamBooth, involves providing a limited set of images of a subject along with a unique “trigger phrase” to personalize the model’s output for that specific subject.
Tools Used:
-
**ai-toolkit**GitHub Repository: Primary Training Tool: Praised for its ability to train FLUX.1 LoRA adapters with significantly lower VRAM requirements (as low as 24GB VRAM on an RTX 4090) compared to other packages like Hugging Face’s Diffusers (which might require >40GB for FLUX.1 training). This efficiency is achieved through “fancy fun quantization methods” in the backend. Provides scripts and configurations specifically tailored for FLUX models. -
ComfyUI: Primary Inference Tool: A powerful, node-based GUI for designing and executing advanced stable diffusion pipelines. It supports FLUX models and is used to load and run the trained LoRA adapters.
Detailed Training Process:
-
Data Preparation (DreamBooth Inspired): Images: Gather a small dataset of images (Adam used 10) of the subject. These images should showcase the subject clearly in various angles, lighting, and poses. Text Files: For each image (
image_1.png,image_2.png, etc.), create a corresponding text file with the exact same name (image_1.txt,image_2.txt, etc.). Prompts: Inside each text file, write a simple description of the image, including a placeholder[trigger]where the unique trigger word will be dynamically inserted during training. Example:[trigger] in blue graduation clothes. Store all image and text file pairs in a dedicated folder (e.g.,lora_me) within yourai-toolkitdirectory. -
**ai-toolkit**Setup & Configuration: Environment: Adam demonstrates the setup on a Linux cloud instance (JupyterLab with an NVIDIA A100 GPU). Clone & Install: Follow the installation steps provided in theai-toolkitREADME (cloning, submodules, creating a virtual environment, installing PyTorch, and requirements). YAML Configuration: Copy thetrain_lora_flux_24gb.yamlexample fromai-toolkit/config/examplesto the rootai-toolkitfolder and rename it (e.g.,train_lora_flux_face.yaml).**name**: Set a descriptive name for your model (e.g.,flux_lora_face).**training_folder**: Specify the output directory for checkpoints and samples (e.g.,output/flux_lora_face).**performance_log_every**: Set to 200 steps to monitor progress.**device**:cuda:0(assuming a single GPU).**trigger_word**: Define your unique word/phrase (e.g.,4d4m luc3k). This is crucial for personalization.**network type**:lora.**linear**: 32 (LoRA rank). Higher values (8, 16, 32) capture more complex features like faces accurately.**save_every**: 200 (save a checkpoint every 200 steps).**max_step_saves_to_keep**: 4 (keep only the last 4 checkpoints to save space).**push_to_hub**:false(Adam recommends manual upload to Hugging Face).**folder_path**: Point to your prepared data folder (e.g.,./lora_me).**caption_ext**:txt.**shuffle_tokens**:true(helps diversify training with small datasets by reordering prompt tokens).**resolution**: 512, 768, 1024 (FLUX supports multiple resolutions for better generalization).**steps**: 2000 (total training steps).**lr**: 4e-4 (Adam adjusted from default for better results with fewer steps).**skip_first_sample**:true(avoids generating low-quality initial samples).**quantize**:true(enables 8-bit mixed precision for VRAM optimization).**model_name_or_path**: “black-forest-labs/FLUX.1-dev” (the base model). Sample Prompts: Customize the predefined sample prompts to include yourtrigger_wordto generate specific validation images during training, allowing you to visually assess model progress. -
Executing Training: Log into Hugging Face via CLI:
huggingface-cli login(paste your access token). Run the training script:python run.py train_lora_flux_face.yaml(using your config filename). The process involves loading components, quantization, and then training. For 2000 steps on an A100, it takes about 1 hour and 5 minutes. Outputs include generated validation images at eachsave_everyinterval and the.safetensorsLoRA adapter file (e.g.,flux_lora_adam.safetensors).
Results and Optimization (Post-Training):
- Validation Photos: The generated images at each checkpoint allow you to observe how the model learns. Early checkpoints might show poor quality, but gradually improve in capturing facial likeness.
- Overtraining: The video notes that continued training can sometimes lead to a “degeneration” of general image quality (e.g., blurrier features) even as the facial resemblance improves. This means the LoRA adapter becomes too specialized and loses the base model’s broader capabilities.
- Combating Overtraining (LoRA Strength): During inference, you can combat overtraining by reducing the strength of the LoRA adapter. Adam suggests values like 70% (
strength_model) to allow the base model’s quality to shine through while retaining the learned facial features. - The final
.safetensorsfile is relatively small (Adam’s was 343.8 MB), making it easy to share and load.
Inference using ComfyUI (Node-Based Workflow):
- Load Workflow: Adam provides a pre-configured ComfyUI workflow JSON file (
workflow_adamlora.json) on his Hugging Face model page. Load this file into your ComfyUI interface. - Model Loading: The workflow includes nodes to: Load the base FLUX.1 Diffusion Model. Load the DualCLIPLoader (for the text encoders). Load the VAE (Variational AutoEncoder). Load LoRA Adapter: This crucial node points to your saved
.safetensorsfile. - Adjust LoRA Strength: The “Load LoRA” node has adjustable
strength_modelandstrength_clipparameters. Experimenting with these (e.g., 0.70 for model, 1.00 for clip) allows you to balance facial likeness with overall image quality. - Prompting: Input your desired text prompt into the CLIP Text Encode (Positive Prompt) node, ensuring you use the same
trigger_wordyou trained with. - Generate: Click “Queue Prompt” to start the image generation.
Conclusion:
The video successfully demonstrates a streamlined and VRAM-efficient process for training personalized image generation models. By leveraging LoRA with FLUX.1 and user-friendly tools like ai-toolkit and ComfyUI, individuals can create high-quality, custom images without requiring extremely expensive, cutting-edge hardware. The emphasis on careful data preparation and post-training adjustments of LoRA strength highlights best practices for achieving optimal results.