Training an Ideogram 4 Style LoRA with AI-Toolkit (Structured JSON Captions)

A complete, reproducible workflow for training a style LoRA on Ideogram 4 using Ostris’ AI-Toolkit, on a single 24 GB consumer GPU.

Jun 11, 2026

This guide focuses on the part most other tutorials skip: structured JSON captioning, which is what makes Ideogram 4 LoRAs actually behave.

The example throughout is a Baroque chiaroscuro / Old Master oil-painting style, modeled on a public-domain artist (think Caravaggio, d. 1610). The method itself is style-agnostic.

A note on choosing a style. I deliberately use a long-dead, public-domain master here instead of a living artist. It sidesteps the ethical and licensing mess around imitating contemporary creators, the source images are freely usable, and a high-contrast style like tenebrism is genuinely a great teaching subject for Ideogram 4. If you adapt this guide, please pick your subject responsibly.

Why Ideogram 4 is different

Ideogram 4 is not a generic character-LoRA target. It is unusually sensitive to prompt structure, text placement, and layout description. If you caption it like an SDXL or Flux dataset (a flat sentence of tags), you tend to get:

weak or mutated text/labels
drifting composition
a LoRA that only works on near-duplicates of your training images

The fix is to caption in a structured format that mirrors how Ideogram 4 internally reasons about an image: a high-level description, an explicit style block, and a compositional breakdown with bounding boxes. More on that below — it’s the heart of this guide.

1. Requirements

Hardware

NVIDIA GPU. This guide is tuned for 24 GB (RTX 4090 / 3090).
With the low_vram + qfloat8 settings below, real usage during training sits around ~17 GB / 24 GB. On a 4090 expect roughly 100% GPU load, ~380 W, ~66 °C sustained.
12 GB is possible but requires layer offloading and/or script tweaks, and is much slower.

Software

Python ≥ 3.10 (3.12 recommended)
Git, a Python venv
An NVIDIA driver + CUDA toolchain compatible with PyTorch 2.9.x / CUDA 12.8

2. Install AI-Toolkit

Linux

git clone https://github.com/ostris/ai-toolkit.git
cd ai-toolkit
python3 -m venv venv
source venv/bin/activate

# install torch first
pip3 install --no-cache-dir torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 \
  --index-url https://download.pytorch.org/whl/cu128
pip3 install -r requirements.txt

Windows

git clone https://github.com/ostris/ai-toolkit.git
cd ai-toolkit
python -m venv venv
.\venv\Scripts\activate

pip install --no-cache-dir torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 ^
  --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt

If Windows install gives you trouble, the community “AI-Toolkit Easy Install” script (Tavris1) handles the dependency mess for you.

The base model is pulled automatically from the Hugging Face repo ideogram-ai/ideogram-4-fp8 when training starts (the config below references it directly), so you don’t need to download weights manually.

3. Build your dataset

Images

30–60 images is a good range for a style. For an Old Master, high-quality public-domain reproductions (museum open-access collections, Wikimedia Commons) are ideal.
Keep resolution at or above your training bucket. This config trains at 1024; multi-resolution buckets (512 / 768 / 1024) also work well and improve robustness to crop/scale.
Visual diversity matters more than count: vary the subject, framing (close-up portrait, half-length, full scene), and number of figures so the LoRA learns the style (lighting, palette, brushwork) rather than memorizing specific paintings.
For a chiaroscuro style specifically, make sure your set isn’t all one composition type — include single figures, multi-figure scenes, and a few still-life details so the dramatic lighting generalizes.

Folder layout

E:\ai-toolkit\datasets\baroque_chiaroscuro\
├── 001.png
├── 001.json
├── 002.png
├── 002.json
└── ...

Each image has a caption file with the same basename. For Ideogram 4 we use .json, not .txt.

4. The key part — structured JSON captions

This is what makes Ideogram 4 LoRAs work. Instead of a flat caption, each image gets a JSON file describing it in three layers: what it is, how it looks (style), and how it’s laid out (composition).

In the config we set:

caption_ext: "json"
default_caption: ""
caption_dropout_rate: 0.05

Caption schema

Here is an example caption file (001.json) for one painting in the set. Use it as a template:

{
  "high_level_description": "A dramatic half-length scene: an older bearded man in dark robes leans over a rough wooden table, his face and hands struck by a single shaft of light from the upper left while the rest of the room falls into deep shadow.",
  "style_description": {
    "aesthetics": "Italian Baroque, tenebrist, intensely dramatic and naturalistic.",
    "lighting": "Single hard light source from the upper left; extreme chiaroscuro with near-black shadows and sharp highlights on skin and fabric.",
    "photo": "Oil on canvas, smooth blended modeling of flesh, warm translucent glazes over a dark ground.",
    "medium": "Oil painting.",
    "color_palette": [
      "#13100C", "#3B2A1E", "#6E4A2E", "#C8956A",
      "#8B1A1A", "#D9B26A", "#E8D5B0"
    ]
  },
  "compositional_deconstruction": {
    "background": "near-black, undefined interior; no visible architecture, the figure emerges from darkness",
    "elements": [
      {
        "type": "obj",
        "bbox": [180, 90, 760, 720],
        "color_palette": ["#3B2A1E", "#6E4A2E", "#C8956A", "#8B1A1A", "#E8D5B0"],
        "desc": "Main figure: an old bearded man in a deep crimson cloak over a white shirt, leaning forward; the light catches his forehead, cheek, and the back of his right hand."
      },
      {
        "type": "obj",
        "bbox": [120, 600, 520, 880],
        "color_palette": ["#3B2A1E", "#6E4A2E", "#D9B26A", "#E8D5B0"],
        "desc": "Foreground still life on the table: an open book with worn pages, a brass bowl, and a single guttering candle, all lit from the same upper-left source."
      },
      {
        "type": "obj",
        "bbox": [620, 200, 900, 760],
        "color_palette": ["#13100C", "#3B2A1E", "#C8956A"],
        "desc": "Bottom-right shadow: a second figure's hand and shoulder barely emerging from the darkness, mostly lost in shadow."
      }
    ]
  }
}

What each part does, and why

high_level_description — one sentence describing the whole image, the way you’d actually prompt it. This anchors the global composition.
style_description — the block that carries your style signal. Keep aesthetics, lighting, photo/render type, and medium consistent across the entire dataset; this consistency is what the trigger word ends up bound to. For a chiaroscuro style, the lighting field is doing the heaviest lifting — describe the single-source, high-contrast lighting the same way every time. The color_palette (hex list) gives Ideogram 4 concrete color anchors (here: dark grounds, warm flesh, crimson, muted gold).
compositional_deconstruction — the part that prevents layout drift and label mutation:
- background describes the canvas/setting (here, the signature undefined black ground).
- each element has a bbox = [x1, y1, x2, y2] in pixel coordinates, a local color_palette, and a desc.
- If the image contains any visible text (an inscription, a signature, lettering on a book), transcribe it exactly. Ideogram 4’s whole strength is clean text; if your captions ignore on-image text, the LoRA learns to garble it.

Practical captioning tips

Be consistent in vocabulary. The same style words in every file = a stronger, cleaner concept.
Don’t bake the trigger word into the caption. The trigger (Baroque Chiaroscuro Style) is injected by the config — keep captions purely descriptive.
bboxes don’t need to be pixel-perfect, but they should reflect the real arrangement (which third of the canvas, relative size). The point is teaching spatial relationships and where the light falls.
Generating these JSONs by hand is slow — a vision model (GPT-4o, a local VLM, etc.) prompted with this exact schema will produce them in batch. Always spot-check the lighting description and any text transcription.

5. The training config

Save this as ideogram4_chiaroscuro.yaml. Paths, trigger word, and dataset folder are the parts you change.

job: "extension"
config:
  name: "Baroque Chiaroscuro Ideogram"
  process:
    - type: "diffusion_trainer"
      training_folder: "E:\\ai-toolkit\\output"
      sqlite_db_path: "./aitk_db.db"
      device: "cuda"
      trigger_word: "Baroque Chiaroscuro Style"
      performance_log_every: 10
      network:
        type: "lora"
        linear: 16
        linear_alpha: 16
        network_kwargs:
          ignore_if_contains: []
      save:
        dtype: "bf16"
        save_every: 250
        max_step_saves_to_keep: 12
        save_format: "diffusers"
        push_to_hub: false
      datasets:
        - folder_path: "E:\\ai-toolkit\\datasets/baroque_chiaroscuro"
          mask_path: null
          mask_min_value: 0.1
          default_caption: ""
          caption_ext: "json"
          caption_dropout_rate: 0.05
          cache_latents_to_disk: true
          is_reg: false
          network_weight: 1
          resolution:
            - 1024
          controls: []
          shrink_video_to_frames: true
          num_frames: 1
          flip_x: false
          flip_y: false
          num_repeats: 1
      train:
        batch_size: 1
        bypass_guidance_embedding: false
        steps: 2500
        gradient_accumulation: 1
        train_unet: true
        train_text_encoder: false
        gradient_checkpointing: true
        noise_scheduler: "flowmatch"
        optimizer: "adamw8bit"
        timestep_type: "linear"
        content_or_style: "balanced"
        optimizer_params:
          weight_decay: 0.0001
        unload_text_encoder: false
        cache_text_embeddings: true
        lr: 0.0001
        ema_config:
          use_ema: true
          ema_decay: 0.99
        skip_first_sample: true
        force_first_sample: false
        disable_sampling: true
        dtype: "bf16"
        diff_output_preservation: false
        diff_output_preservation_multiplier: 1
        diff_output_preservation_class: "person"
        switch_boundary_every: 1
        loss_type: "mse"
      logging:
        log_every: 1
        use_ui_logger: true
      model:
        name_or_path: "ideogram-ai/ideogram-4-fp8"
        quantize: true
        qtype: "qfloat8"
        quantize_te: true
        qtype_te: "qfloat8"
        arch: "ideogram4"
        low_vram: true
        model_kwargs: {}
        layer_offloading: false
        layer_offloading_text_encoder_percent: 1
        layer_offloading_transformer_percent: 1
      sample:
        sampler: "flowmatch"
        sample_every: 250
        width: 768
        height: 768
        samples: []
        neg: ""
        seed: 42
        walk_seed: true
        guidance_scale: 4
        sample_steps: 30
        num_frames: 1
        fps: 1
meta:
  name: "[name]"
  version: "1.0"

The settings that matter

Network

linear: 16 / linear_alpha: 16 — rank 16 / alpha 16 is a solid, light style LoRA. For a more complex or higher-fidelity style, rank/alpha 32 is the common step up (at higher VRAM cost).

Model — leave these alone on your first run

arch: "ideogram4" and name_or_path: "ideogram-ai/ideogram-4-fp8" — required for this model.
quantize: true + qtype: "qfloat8", plus the same for the text encoder (quantize_te / qtype_te), and low_vram: true. This trio is what lets Ideogram 4 train on 24 GB. Trying to “improve quality” by disabling quantization or low-VRAM mode is the #1 way to OOM.
layer_offloading: false here because 24 GB is enough. On 12–16 GB, set it true and raise the offloading percentages to fit.

Training

steps: 2500, lr: 0.0001, adamw8bit, noise_scheduler: flowmatch, timestep_type: linear — these are the safe Ideogram 4 baseline.
content_or_style: "balanced" — appropriate for a style LoRA that still needs to respect composition. Push toward style only if specific subjects/content are bleeding through too strongly.
ema_config.use_ema: true, ema_decay: 0.99 — EMA smooths the weights and gives a more stable final LoRA. Worth keeping.
train_text_encoder: false — keep the TE frozen; you’re teaching a visual style, and freezing it protects Ideogram 4’s text-rendering ability.
cache_latents_to_disk: true + cache_text_embeddings: true — caches the heavy preprocessing so subsequent runs start faster.

Saving

save_every: 250, save_format: "diffusers", keeping up to 12 checkpoints. Saving every 250 steps lets you pick the best epoch later instead of trusting the final step — important, because the sweet spot for style is often before the end.

6. Run it

From the AI-Toolkit Web UI: create a new job, paste/import the config, and start. Or from CLI:

python run.py ideogram4_chiaroscuro.yaml

Expected behavior on a 4090 (24 GB):

VRAM: ~17 GB used
GPU load: ~100%, power ~380 W, temp ~60–70 °C
Time: order of ~1.5–2 hours for ~1000 steps depending on resolution and caching (so plan accordingly for 2500).

7. Sampling / testing

The config above has disable_sampling: true to save VRAM and time during the run. To watch progress instead, set disable_sampling: false and add validation prompts under sample.samples.

Validation prompts: don’t only test things that look like your dataset. A LoRA that only renders near-duplicates isn’t ready. Test:

close-up portrait, half-length, and full multi-figure compositions
subjects your dataset never contained (e.g. a modern object or a landscape) rendered in the style — this is the real test of generalization
a prompt with on-image text if text matters to your use case (this is where Ideogram 4 LoRAs commonly fail)
simple prompts and complex multi-element ones
3–5 seeds per prompt
multiple LoRA strengths (0.5 / 0.75 / 1.0)
the same prompts with the LoRA off, as a baseline

8. Common pitfalls

Garbled / mutating text → your captions ignored on-image text, or you overtrained. Transcribe visible text in the desc fields and pick an earlier checkpoint.
Layout drift → weak compositional deconstruction. Add/clean up the bbox elements.
Lighting collapses to flat/even → your lighting descriptions weren’t consistent across the dataset, or content_or_style is too far toward content. Tighten the captions and lean toward style.
Overtraining → the LoRA starts damaging Ideogram 4’s native strengths (clean text, spatial obedience, prompt following). If late checkpoints look “burned,” go back to a 1000–1750-step save.
OOM → you disabled quantization or low-VRAM, raised rank too high, or pushed resolution. Re-enable quantize/quantize_te/low_vram, or turn on layer_offloading.
LoRA only works on dataset look-alikes → not enough visual diversity, too many steps, or captions that don’t generalize.

9. TL;DR

Install AI-Toolkit (torch 2.9.1 / cu128).
Collect a diverse, consistent style dataset (public-domain sources keep you safe).
Caption each image as structured JSON — high-level + style block + compositional breakdown with bboxes, transcribing any visible text.
Use the config above: rank 16, ideogram4 arch, fp8 quantization, low VRAM, EMA, 2500 steps @ 1e-4.
Save every 250 steps and pick the best checkpoint, not the last one.
Validate on varied compositions and on subjects outside your dataset.

That structured-caption step is the difference between an Ideogram 4 LoRA that holds layout, lighting, and text, and one that falls apart. Don’t skip it.

Luca Cris

Discussion about this post

Ready for more?