Your GPU Is 97% Utilized But Your Training Is 3x Slower Than Expected
TL;DR Your GPU shows 97% utilization in nvidia-smi, but training throughput is a fraction of what benchmarks promise. The GPU isn't computing — it's waiting. Data loading workers are starving the t...

Source: DEV Community
TL;DR Your GPU shows 97% utilization in nvidia-smi, but training throughput is a fraction of what benchmarks promise. The GPU isn't computing — it's waiting. Data loading workers are starving the training loop because CPU contention, I/O bottlenecks, or scheduling delays prevent data from arriving fast enough. Ingero traces the full host-to-GPU pipeline to show you exactly where the bubble is. The Problem You've got an H100 costing $3.50/hour. PyTorch Lightning says you're processing 200 samples/sec, but the model card says the same architecture should hit 600 samples/sec on this hardware. You open nvidia-smi: +-------------------------------------------+ | GPU Name | GPU-Util | Memory-Usage | |==================+==========+==============| | 0 H100 SXM | 97% | 62000MiB/80GB | +-------------------------------------------+ 97% utilization. The GPU must be working hard, right? Wrong. That number means "the GPU had at least one kernel running 97% of the time." It doesn't distinguish betwee