Analysis Nvidia on Tuesday unveiled the Rubin CPX, a GPU designed specifically to accelerate extremely long-context AI workflows like those seen in code assistants such as Microsoft's GitHub Copilot, while simultaneously cutting back on pricey and power-hungry high-bandwidth memory (HBM).
The first indication that Nvidia might be moving in this direction came when CEO Jensen Huang unveiled Dynamo during his GTC keynote in spring. The framework brought mainstream attention to the idea of disaggregated inference.
As you may already be aware, inference on large language models (LLMs) can be broken into two categories: a computationally intensive compute phase and a second memory bandwidth-bound decode phase.
Traditionally, both the prefill and decode have taken place on the same GPU. Disag