gekro
GitHub LinkedIn
AI

LoRA / QLoRA Memory Calculator

Pick a base model + training mode → see peak fine-tuning VRAM and which GPUs can fit the job.

Base model

Training mode

LoRA: base model frozen, train small adapter matrices. Saves optimizer + gradient memory.

LoRA settings

Rank controls adapter expressiveness. 8-32 standard; 64+ for harder tasks. "All linear" = matches PEFT default.

Batch & sequence

Peak VRAM

— GB

Model weights
Trainable params
Optimizer states
Gradients
Activations
Overhead (10%)

GPU fit

Estimates are approximate (±15%). Real-world VRAM also depends on framework (HF Transformers, Axolotl, Unsloth all differ), kernel choices, and DeepSpeed/FSDP sharding strategy.

As-is, no warranty. These apps are free under their listed license and run entirely in your browser. Use at your own risk — don't blame me if your PC catches fire, your dog runs away, or the math turns out wrong. Verify anything that actually matters. None of this is professional financial, medical, legal, or engineering advice.

© 2026 Rohit Burani · MIT · Built at gekro.com · View source ↗

Guide

What It Does

Pre-flight memory check before you launch a fine-tune. Pick:

  • Base model (or input custom dimensions)
  • Training mode: Full / LoRA / QLoRA
  • LoRA rank + target modules
  • Batch size × sequence length
  • Gradient checkpointing on/off
  • Flash Attention 2 on/off

Get back: total peak VRAM, per-component breakdown (weights / optimizer / gradients / activations), and a GPU compatibility table.

How The Math Works

ComponentFormula
Model weightsparams × bytes_per_param — bf16/fp16 = 2 bytes, NF4 (QLoRA) = ~0.5
LoRA adapter weights2 × rank × hidden × modules × layers × 2 bytes (bf16)
Optimizer (AdamW)8 bytes × trainable_params (m + v in fp32)
Gradients4 bytes × trainable_params (fp32)
Activationsbatch × seq × hidden × layers × act_factor
act_factor ≈ 34 → ×0.4 with checkpointing → ×0.5 with FA2
Framework overhead+10% on subtotal

LoRA freezes the base model — only adapter params (typically 0.1-2% of total) are trainable, so optimizer and gradient memory shrink ~100×. QLoRA additionally quantises the frozen base to NF4, shrinking weight memory ~4×.

Worked Example

Llama 3.3 70B with QLoRA, rank 16, batch 4 × seq 2048, gradient checkpointing + FA2:

  • Weights: 70B × 0.5 bytes ≈ 35 GB
  • Trainable params: ~167M (0.2% of total)
  • Optimizer: 167M × 8 ≈ 1.3 GB
  • Gradients: 167M × 4 ≈ 0.7 GB
  • Activations (with FA2 + checkpoint): ~4 GB
  • Peak ≈ 41-45 GB — fits on A100 80GB / L40S 48GB, tight on RTX 5090 32GB.

By contrast, full fine-tune of the same model:

  • Weights: 140 GB
  • Optimizer: 564 GB
  • Gradients: 282 GB
  • Peak ≈ 1+ TB — needs DeepSpeed ZeRO-3 sharding across 8+ A100s.

What’s NOT Modelled

  • DeepSpeed / FSDP sharding — multi-GPU training reduces per-GPU memory
  • Unsloth optimizations — Unsloth can be 30-50% lower than these estimates
  • Inference VRAM — see GPU VRAM Calculator
  • Mixed-precision edge cases — fp8 training, MX-FP4 etc.

Estimates carry ±15% error. Always reserve 10-20% headroom.

For informational purposes only. Not financial, medical, or legal advice. You are solely responsible for how you use these tools.