On a certain company’s node there are 8 × A100 GPUs, but everyone is crammed onto a single 500 GB SATA drive, whose storage is even less than the VRAM. This causes someone to be constantly uneasy while running experiments, seeing the Avail drop randomly at any time, fearing “No space left on device”.
Finally, one night someone couldn’t stand it any longer and used Docker
docker run -it --rm --privileged --pid=host quay.io/fedora/fedora:41 nsenter -t $$ -a
After getting root privileges, they discovered that this node actually has two unmounted 3 TB disks. The person, who lives a precarious life of alchemy, has been running LLMs on 500 GB; thinking of this, they felt a surge of frustration, noticed that the disks were not partitioned (their LVM is XFS), so they partitioned and mounted one, creating a small target of 1 TB.
Seeing Avail 928 GB, they felt much more comfortable.