# Local Image Generation with FLUX.2 + Vulkan

**No cloud. No discrete GPU. No ComfyUI.**

This workflow runs on a **Ryzen AI 9 HX 370 mini PC** (AMD Radeon 890M integrated GPU, 121GB unified RAM) using `stable-diffusion.cpp` with Vulkan acceleration. Generation time: ~1–1.5 min per image at 8–12 steps.

---

## Hardware

| Component | Spec |
|-----------|------|
| CPU/APU | AMD Ryzen AI 9 HX 370 |
| iGPU | AMD Radeon 890M (gfx1150) |
| RAM | 121GB unified (shared with iGPU) |
| Storage | 1.9TB NVMe |
| OS | Ubuntu 24.04 |

The iGPU shares system RAM — no VRAM limit to worry about. All models fit comfortably.

---

## Models

Three components are required. All are open weights.

### Diffusion Model
- **FLUX.2-klein-9B Q4_0** (~5.3GB)
- Source: [leejet/FLUX.2-klein-9B-GGUF](https://huggingface.co/leejet/FLUX.2-klein-9B-GGUF)
- File: `flux-2-klein-9b-Q4_0.gguf`

### Text Encoder (LLM)
- **Qwen3-8B Q4_K_M** (~4.7GB)
- Source: [unsloth/Qwen3-8B-GGUF](https://huggingface.co/unsloth/Qwen3-8B-GGUF)
- File: `Qwen3-8B-Q4_K_M.gguf`

### VAE
- **flux2-vae.safetensors** (~321MB)
- Source: [Comfy-Org/flux2-klein-4B](https://huggingface.co/Comfy-Org/flux2-klein-4B) (split_files/vae/)

A 4B variant also works (`FLUX.2-klein-4B` + `Qwen3-4B`) — faster, slightly lower quality.

---

## Build: stable-diffusion.cpp with Vulkan

### Prerequisites

```bash
sudo apt install cmake build-essential libvulkan-dev glslc ninja-build
```

Verify Vulkan sees your GPU:

```bash
vulkaninfo | grep deviceName
# Should show: AMD Radeon Graphics (RADV GFX1150) or similar
```

### Clone and Build

```bash
git clone --depth 1 https://github.com/leejet/stable-diffusion.cpp ~/code/stable-diffusion.cpp
cd ~/code/stable-diffusion.cpp
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DSD_VULKAN=ON
make -j$(nproc) sd-cli sd-server
```

Binaries land at `build/bin/sd-cli` and `build/bin/sd-server`.

> **Note:** The ROCm/HIP build path has linker issues on some AMD iGPUs (PIE relocation errors). Vulkan is the reliable path for integrated AMD graphics.

---

## Running: Server Mode (recommended)

`sd-server` loads all three models once at startup and keeps them hot in memory. Subsequent generations go straight to compute — no cold-start overhead per request. Exposes an **A1111-compatible HTTP API**.

### systemd service

```ini
[Unit]
Description=sd-server - FLUX.2 image generation server
After=network.target

[Service]
Type=simple
User=youruser
ExecStart=/path/to/stable-diffusion.cpp/build/bin/sd-server \
  --listen-ip 0.0.0.0 \
  --listen-port 8189 \
  --diffusion-model /path/to/models/flux2-klein-9b/flux-2-klein-9b-Q4_0.gguf \
  --vae /path/to/models/flux2-klein-9b/flux2-vae.safetensors \
  --llm /path/to/models/flux2-klein-9b/Qwen3-8B-Q4_K_M.gguf \
  --diffusion-fa \
  -v
Restart=on-failure

[Install]
WantedBy=multi-user.target
```

```bash
sudo systemctl enable --now sd-server
```

Model load on startup takes ~3s. After that, the server is ready.

### API usage

**Generate an image** (`POST /sdapi/v1/txt2img`):

```bash
curl -s -X POST http://localhost:8189/sdapi/v1/txt2img \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "your prompt here",
    "cfg_scale": 1.0,
    "steps": 12,
    "seed": 42,
    "width": 768,
    "height": 512
  }' | python3 -c "
import sys, json, base64
data = json.load(sys.stdin)
open('output.png', 'wb').write(base64.b64decode(data['images'][0]))
print('Saved output.png')
"
```

Response is JSON with `images` as a list of base64-encoded PNGs.

**Reference image / kontext editing** (`POST /sdapi/v1/img2img`):

```bash
curl -s -X POST http://localhost:8189/sdapi/v1/img2img \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "photorealistic live action version, same composition, cinematic lighting",
    "init_images": ["'$(base64 -w0 input.png)'"],
    "cfg_scale": 1.0,
    "steps": 12,
    "seed": 42,
    "width": 768,
    "height": 512
  }' | python3 -c "
import sys, json, base64
data = json.load(sys.stdin)
open('output.png', 'wb').write(base64.b64decode(data['images'][0]))
"
```

### Key parameters

| Parameter | Value | Notes |
|-----------|-------|-------|
| `cfg_scale` | `1.0` | FLUX.2-klein is distilled — use 1.0, not 7.0 |
| `steps` | `8–12` | 4 works, 12 is the sweet spot, 20 gives diminishing returns |
| `seed` | any int | Fixes output for reproducibility; omit for random |
| `width`/`height` | 512–768 | Portrait: 512×768, landscape: 768×512 |

---

## Running: CLI Mode (one-shot)

For scripting or single generations without a running server:

```bash
~/code/stable-diffusion.cpp/build/bin/sd-cli \
  --diffusion-model ~/models/flux2-klein-9b/flux-2-klein-9b-Q4_0.gguf \
  --vae ~/models/flux2-klein-9b/flux2-vae.safetensors \
  --llm ~/models/flux2-klein-9b/Qwen3-8B-Q4_K_M.gguf \
  --prompt "your prompt here" \
  --cfg-scale 1.0 \
  --steps 12 \
  --seed 42 \
  --width 768 --height 512 \
  --diffusion-fa \
  --output output.png
```

Note: CLI mode reloads models from disk on every invocation (~3s overhead). For iterative work, the server is strongly preferred.

### Reference image (CLI)

```bash
  -r input.png \
  --prompt "photorealistic live action version, same composition, cinematic lighting"
```

Best for editing photos. Style transfer from cartoons/animation works but text prompt has less influence over the reference image's composition.

---

## Prompt Strategy

FLUX.2-klein responds well to detailed, descriptive prompts. Key lessons:

**Be explicit about physical attributes** — the model won't infer ethnicity, hair color, or eye color. State them directly.

```
# Weak
"a woman with dark hair"

# Strong  
"a tall athletic African-American woman with long glossy black hair, pale green eyes"
```

**Seed hunting for hands** — finger geometry is highly seed-dependent. The face locks in early; hands vary. If you get a great face with bad hands, just reseed.

**Character + environment in one prompt** — describe lighting, background, and composition alongside characters. The model handles all of it simultaneously.

**Cinematic language works** — phrases like "3-point lighting," "golden ratio composition," "natural depth of field," "cinematic color grade" genuinely influence the output.

---

## Performance

| Steps | Time (768×512, 9B model) |
|-------|--------------------------|
| 4 | ~35s |
| 8 | ~60s |
| 12 | ~84s |
| 20 | ~140s |

The 4-step distilled mode is usable for iteration. 12 steps is the quality sweet spot. Beyond 20 steps shows minimal improvement.

---

## Directory Layout

```
~/models/
├── flux2-klein-9b/
│   ├── flux-2-klein-9b-Q4_0.gguf     # 5.3GB
│   ├── Qwen3-8B-Q4_K_M.gguf          # 4.7GB
│   └── flux2-vae.safetensors          # 321MB
└── flux2-klein/
    ├── flux-2-klein-4b-Q4_0.gguf     # 2.3GB
    ├── Qwen3-4B-Q4_K_M.gguf          # 2.4GB
    └── flux2-vae.safetensors          # 321MB (or symlink)
```

---

## vs. Cloud / Discrete GPU

This setup intentionally avoids:
- Cloud APIs (no per-image cost, no data leaving the machine)
- ComfyUI (no dependency overhead, scriptable CLI)
- NVIDIA/CUDA (pure AMD + Vulkan open stack)

Quality is competitive. Speed is limited by the iGPU (~1 min/image) vs a discrete card (~5–10s), but for non-realtime use it's entirely practical.
