It seems like NVIDIA's flagship GPUs, the GeForce RTX 5090 and the RTX PRO 6000, have encountered a new bug that involves unresponsiveness under virtualization.
NVIDIA's Flagship Blackwell GPUs Are Becoming 'Unresponsive' After Extensive VM Usage
CloudRift, a GPU cloud for developers, was the first to report crashing issues with NVIDIA's high-end GPUs. According to them, after the SKUs were under a 'few days' of VM usage, they started to become completely unresponsive. Interestingly, the GPUs can no longer be accessed unless the node system is rebooted. The problem is claimed to be specific to just the RTX 5090 and the RTX PRO 6000, and models such as the RTX 4090, Hopper H100s, and the Blackwell-based B200s aren't affected for now.
The problem specifically occurs when the GPU is assi