Training an Ideogram 4 Style LoRA with AI-Toolkit (Structured JSON Captions)
A complete, reproducible workflow for training a style LoRA on Ideogram 4 using Ostris’ AI-Toolkit, on a single 24 GB consumer GPU.
This guide focuses on the part most other tutorials skip: structured JSON captioning, which is what makes Ideogram 4 LoRAs actually behave.
The example throughout is a Baroque chiaroscuro / Old Master oil-painting style, modeled on a public-domain artist (think Caravaggio, d. 1610). The method itself is style-agnostic.
A note on choosing a style. I deliberately use a long-dead, public-domain master here instead of a living artist. It sidesteps the ethical and licensing mess around imitating contemporary creators, the source images are freely usable, and a high-contrast style like tenebrism is genuinely a great teaching subject for Ideogram 4. If you adapt this guide, please pick your subject responsibly.
Why Ideogram 4 is different
Ideogram 4 is not a generic character-LoRA target. It is unusually sensitive to prompt structure, text placement, and layout description. If you caption it like an SDXL or Flux dataset (a flat sentence of tags), you tend to get:
weak or mutated text/labels
drifting composition
a LoRA that only works on near-duplicates of your training images
The fix is to caption in a structured format that mirrors how Ideogram 4 internally reasons about an image: a high-level description, an explicit style block, and a compositional breakdown with bounding boxes. More on that below — it’s the heart of this guide.
1. Requirements
Hardware
NVIDIA GPU. This guide is tuned for 24 GB (RTX 4090 / 3090).
With the
low_vram+qfloat8settings below, real usage during training sits around ~17 GB / 24 GB. On a 4090 expect roughly 100% GPU load, ~380 W, ~66 °C sustained.12 GB is possible but requires layer offloading and/or script tweaks, and is much slower.
Software
Python ≥ 3.10 (3.12 recommended)
Git, a Python venv
An NVIDIA driver + CUDA toolchain compatible with PyTorch 2.9.x / CUDA 12.8
2. Install AI-Toolkit
Linux
git clone https://github.com/ostris/ai-toolkit.git
cd ai-toolkit
python3 -m venv venv
source venv/bin/activate
# install torch first
pip3 install --no-cache-dir torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 \
--index-url https://download.pytorch.org/whl/cu128
pip3 install -r requirements.txt
Windows
git clone https://github.com/ostris/ai-toolkit.git
cd ai-toolkit
python -m venv venv
.\venv\Scripts\activate
pip install --no-cache-dir torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 ^
--index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
If Windows install gives you trouble, the community “AI-Toolkit Easy Install” script (Tavris1) handles the dependency mess for you.
The base model is pulled automatically from the Hugging Face repo ideogram-ai/ideogram-4-fp8 when training starts (the config below references it directly), so you don’t need to download weights manually.
3. Build your dataset
Images
30–60 images is a good range for a style. For an Old Master, high-quality public-domain reproductions (museum open-access collections, Wikimedia Commons) are ideal.
Keep resolution at or above your training bucket. This config trains at 1024; multi-resolution buckets (512 / 768 / 1024) also work well and improve robustness to crop/scale.
Visual diversity matters more than count: vary the subject, framing (close-up portrait, half-length, full scene), and number of figures so the LoRA learns the style (lighting, palette, brushwork) rather than memorizing specific paintings.
For a chiaroscuro style specifically, make sure your set isn’t all one composition type — include single figures, multi-figure scenes, and a few still-life details so the dramatic lighting generalizes.
Folder layout
E:\ai-toolkit\datasets\baroque_chiaroscuro\
├── 001.png
├── 001.json
├── 002.png
├── 002.json
└── ...
Each image has a caption file with the same basename. For Ideogram 4 we use .json, not .txt.
4. The key part — structured JSON captions
This is what makes Ideogram 4 LoRAs work. Instead of a flat caption, each image gets a JSON file describing it in three layers: what it is, how it looks (style), and how it’s laid out (composition).
In the config we set:
caption_ext: "json"
default_caption: ""
caption_dropout_rate: 0.05
Caption schema
Here is an example caption file (001.json) for one painting in the set. Use it as a template:
{
"high_level_description": "A dramatic half-length scene: an older bearded man in dark robes leans over a rough wooden table, his face and hands struck by a single shaft of light from the upper left while the rest of the room falls into deep shadow.",
"style_description": {
"aesthetics": "Italian Baroque, tenebrist, intensely dramatic and naturalistic.",
"lighting": "Single hard light source from the upper left; extreme chiaroscuro with near-black shadows and sharp highlights on skin and fabric.",
"photo": "Oil on canvas, smooth blended modeling of flesh, warm translucent glazes over a dark ground.",
"medium": "Oil painting.",
"color_palette": [
"#13100C", "#3B2A1E", "#6E4A2E", "#C8956A",
"#8B1A1A", "#D9B26A", "#E8D5B0"
]
},
"compositional_deconstruction": {
"background": "near-black, undefined interior; no visible architecture, the figure emerges from darkness",
"elements": [
{
"type": "obj",
"bbox": [180, 90, 760, 720],
"color_palette": ["#3B2A1E", "#6E4A2E", "#C8956A", "#8B1A1A", "#E8D5B0"],
"desc": "Main figure: an old bearded man in a deep crimson cloak over a white shirt, leaning forward; the light catches his forehead, cheek, and the back of his right hand."
},
{
"type": "obj",
"bbox": [120, 600, 520, 880],
"color_palette": ["#3B2A1E", "#6E4A2E", "#D9B26A", "#E8D5B0"],
"desc": "Foreground still life on the table: an open book with worn pages, a brass bowl, and a single guttering candle, all lit from the same upper-left source."
},
{
"type": "obj",
"bbox": [620, 200, 900, 760],
"color_palette": ["#13100C", "#3B2A1E", "#C8956A"],
"desc": "Bottom-right shadow: a second figure's hand and shoulder barely emerging from the darkness, mostly lost in shadow."
}
]
}
}
What each part does, and why
high_level_description— one sentence describing the whole image, the way you’d actually prompt it. This anchors the global composition.style_description— the block that carries your style signal. Keepaesthetics,lighting,photo/render type, andmediumconsistent across the entire dataset; this consistency is what the trigger word ends up bound to. For a chiaroscuro style, thelightingfield is doing the heaviest lifting — describe the single-source, high-contrast lighting the same way every time. Thecolor_palette(hex list) gives Ideogram 4 concrete color anchors (here: dark grounds, warm flesh, crimson, muted gold).compositional_deconstruction— the part that prevents layout drift and label mutation:backgrounddescribes the canvas/setting (here, the signature undefined black ground).each
elementhas abbox=[x1, y1, x2, y2]in pixel coordinates, a localcolor_palette, and adesc.If the image contains any visible text (an inscription, a signature, lettering on a book), transcribe it exactly. Ideogram 4’s whole strength is clean text; if your captions ignore on-image text, the LoRA learns to garble it.
Practical captioning tips
Be consistent in vocabulary. The same style words in every file = a stronger, cleaner concept.
Don’t bake the trigger word into the caption. The trigger (
Baroque Chiaroscuro Style) is injected by the config — keep captions purely descriptive.bboxes don’t need to be pixel-perfect, but they should reflect the real arrangement (which third of the canvas, relative size). The point is teaching spatial relationships and where the light falls.
Generating these JSONs by hand is slow — a vision model (GPT-4o, a local VLM, etc.) prompted with this exact schema will produce them in batch. Always spot-check the lighting description and any text transcription.
5. The training config
Save this as ideogram4_chiaroscuro.yaml. Paths, trigger word, and dataset folder are the parts you change.
job: "extension"
config:
name: "Baroque Chiaroscuro Ideogram"
process:
- type: "diffusion_trainer"
training_folder: "E:\\ai-toolkit\\output"
sqlite_db_path: "./aitk_db.db"
device: "cuda"
trigger_word: "Baroque Chiaroscuro Style"
performance_log_every: 10
network:
type: "lora"
linear: 16
linear_alpha: 16
network_kwargs:
ignore_if_contains: []
save:
dtype: "bf16"
save_every: 250
max_step_saves_to_keep: 12
save_format: "diffusers"
push_to_hub: false
datasets:
- folder_path: "E:\\ai-toolkit\\datasets/baroque_chiaroscuro"
mask_path: null
mask_min_value: 0.1
default_caption: ""
caption_ext: "json"
caption_dropout_rate: 0.05
cache_latents_to_disk: true
is_reg: false
network_weight: 1
resolution:
- 1024
controls: []
shrink_video_to_frames: true
num_frames: 1
flip_x: false
flip_y: false
num_repeats: 1
train:
batch_size: 1
bypass_guidance_embedding: false
steps: 2500
gradient_accumulation: 1
train_unet: true
train_text_encoder: false
gradient_checkpointing: true
noise_scheduler: "flowmatch"
optimizer: "adamw8bit"
timestep_type: "linear"
content_or_style: "balanced"
optimizer_params:
weight_decay: 0.0001
unload_text_encoder: false
cache_text_embeddings: true
lr: 0.0001
ema_config:
use_ema: true
ema_decay: 0.99
skip_first_sample: true
force_first_sample: false
disable_sampling: true
dtype: "bf16"
diff_output_preservation: false
diff_output_preservation_multiplier: 1
diff_output_preservation_class: "person"
switch_boundary_every: 1
loss_type: "mse"
logging:
log_every: 1
use_ui_logger: true
model:
name_or_path: "ideogram-ai/ideogram-4-fp8"
quantize: true
qtype: "qfloat8"
quantize_te: true
qtype_te: "qfloat8"
arch: "ideogram4"
low_vram: true
model_kwargs: {}
layer_offloading: false
layer_offloading_text_encoder_percent: 1
layer_offloading_transformer_percent: 1
sample:
sampler: "flowmatch"
sample_every: 250
width: 768
height: 768
samples: []
neg: ""
seed: 42
walk_seed: true
guidance_scale: 4
sample_steps: 30
num_frames: 1
fps: 1
meta:
name: "[name]"
version: "1.0"
The settings that matter
Network
linear: 16/linear_alpha: 16— rank 16 / alpha 16 is a solid, light style LoRA. For a more complex or higher-fidelity style, rank/alpha 32 is the common step up (at higher VRAM cost).
Model — leave these alone on your first run
arch: "ideogram4"andname_or_path: "ideogram-ai/ideogram-4-fp8"— required for this model.quantize: true+qtype: "qfloat8", plus the same for the text encoder (quantize_te/qtype_te), andlow_vram: true. This trio is what lets Ideogram 4 train on 24 GB. Trying to “improve quality” by disabling quantization or low-VRAM mode is the #1 way to OOM.layer_offloading: falsehere because 24 GB is enough. On 12–16 GB, set ittrueand raise the offloading percentages to fit.
Training
steps: 2500,lr: 0.0001,adamw8bit,noise_scheduler: flowmatch,timestep_type: linear— these are the safe Ideogram 4 baseline.content_or_style: "balanced"— appropriate for a style LoRA that still needs to respect composition. Push towardstyleonly if specific subjects/content are bleeding through too strongly.ema_config.use_ema: true,ema_decay: 0.99— EMA smooths the weights and gives a more stable final LoRA. Worth keeping.train_text_encoder: false— keep the TE frozen; you’re teaching a visual style, and freezing it protects Ideogram 4’s text-rendering ability.cache_latents_to_disk: true+cache_text_embeddings: true— caches the heavy preprocessing so subsequent runs start faster.
Saving
save_every: 250,save_format: "diffusers", keeping up to 12 checkpoints. Saving every 250 steps lets you pick the best epoch later instead of trusting the final step — important, because the sweet spot for style is often before the end.
6. Run it
From the AI-Toolkit Web UI: create a new job, paste/import the config, and start. Or from CLI:
python run.py ideogram4_chiaroscuro.yaml
Expected behavior on a 4090 (24 GB):
VRAM: ~17 GB used
GPU load: ~100%, power ~380 W, temp ~60–70 °C
Time: order of ~1.5–2 hours for ~1000 steps depending on resolution and caching (so plan accordingly for 2500).
7. Sampling / testing
The config above has disable_sampling: true to save VRAM and time during the run. To watch progress instead, set disable_sampling: false and add validation prompts under sample.samples.
Validation prompts: don’t only test things that look like your dataset. A LoRA that only renders near-duplicates isn’t ready. Test:
close-up portrait, half-length, and full multi-figure compositions
subjects your dataset never contained (e.g. a modern object or a landscape) rendered in the style — this is the real test of generalization
a prompt with on-image text if text matters to your use case (this is where Ideogram 4 LoRAs commonly fail)
simple prompts and complex multi-element ones
3–5 seeds per prompt
multiple LoRA strengths (0.5 / 0.75 / 1.0)
the same prompts with the LoRA off, as a baseline
8. Common pitfalls
Garbled / mutating text → your captions ignored on-image text, or you overtrained. Transcribe visible text in the
descfields and pick an earlier checkpoint.Layout drift → weak compositional deconstruction. Add/clean up the
bboxelements.Lighting collapses to flat/even → your
lightingdescriptions weren’t consistent across the dataset, orcontent_or_styleis too far toward content. Tighten the captions and lean towardstyle.Overtraining → the LoRA starts damaging Ideogram 4’s native strengths (clean text, spatial obedience, prompt following). If late checkpoints look “burned,” go back to a 1000–1750-step save.
OOM → you disabled quantization or low-VRAM, raised rank too high, or pushed resolution. Re-enable
quantize/quantize_te/low_vram, or turn onlayer_offloading.LoRA only works on dataset look-alikes → not enough visual diversity, too many steps, or captions that don’t generalize.
9. TL;DR
Install AI-Toolkit (torch 2.9.1 / cu128).
Collect a diverse, consistent style dataset (public-domain sources keep you safe).
Caption each image as structured JSON — high-level + style block + compositional breakdown with bboxes, transcribing any visible text.
Use the config above: rank 16,
ideogram4arch, fp8 quantization, low VRAM, EMA, 2500 steps @ 1e-4.Save every 250 steps and pick the best checkpoint, not the last one.
Validate on varied compositions and on subjects outside your dataset.
That structured-caption step is the difference between an Ideogram 4 LoRA that holds layout, lighting, and text, and one that falls apart. Don’t skip it.

Appreciate it. Should I tweak the config a little bit if I train on rtx pro 6000 96gb?