PyTorch Blog·Hardware·1d ago·by PyTorch Foundation·~3 min read

PyTorch 2.12 Release Blog

Featured projects We are excited to announce the release of PyTorch® 2.12 (release notes)! The PyTorch 2.12 release features the following changes: - Batched linalg.eigh on CUDA is up to 100x faster due to updated cuSolver backend selection - New torch.accelerator.Graph API unifies graph capture and replay across CUDA, XPU, and out-of-tree backends torch.export.save now supports Microscaling (MX) quantization formats, enabling full export of aggressively compressed models- Adagrad now supports fused=True , joining Adam, AdamW, and SGD with a single-kernel optimizer implementation torch.cond control flow can now be captured and replayed inside CUDA Graphs- ROCm users gain expandable memory segments, rocSHMEM symmetric memory collectives, and FlexAttention pipelining This release is composed of 2,926 commits from 457 contributors since PyTorch 2.11. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.12. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page. Have questions? Join us on Wednesday, May 20 at 10 am PST for a live Q&A with panelists Joe Spisak, Andrey Talman, and Alban Desmaison, and moderator Chris Gottbrath. We will provide a brief overview of the release and answer your questions live. Register now. Throughout the 2.x series, PyTorch has been evolving from a research-first framework into a unified, hardware-agnostic platform for production training and inference at scale. PyTorch 2.10 laid the groundwork with cross-backend performance primitives and the formal deprecation of TorchScript. PyTorch 2.11 expanded that foundation with differentiable collectives for distributed training, FlashAttention-4 on next-generation GPUs, and broader export coverage. PyTorch 2.12 continues this direction: a new device-agnostic torch.accelerator.Graph API unifies graph capture and replay across CUDA, XPU, and out-of-tree backends; batched eigenvalue decomposition is up to 100x faster; and torch.export now supports Microscaling quantization formats for deploying aggressively compressed models. Across these releases, PyTorch is becoming faster across backends and usable in a wider variety of platforms as it continues to enable AI innovation. Performance Features Up to 100x faster batched eigendecomposition on CUDA (linalg.eigh ) The backend selection for linalg.eigh on CUDA has been overhauled. The legacy MAGMA backend was deprecated in favor of cuSolver (PR #174619 by Grayson Derossi), and the cuSolver dispatch heuristics were updated to use syevj_batched unconditionally (PR #175403 by Johannes Z). For batched symmetric/Hermitian eigenvalue problems, this yields up to 100x speedups over the previous release, resolving longstanding performance gaps with CuPy. Workloads which previously took minutes (because PyTorch was inefficiently dispatching each matrix solve individually) now run in seconds by using cuSolver’s syevj_batched kernel, which is designed to process many small/medium matrices as a single GPU operation. These gains are especially relevant for scientific computing and machine learning workloads that rely on eigendecompositions of batched matrices. (example usage in the doc) Fused Adagrad optimizer The Adagrad optimizer now supports fused=True, performing the entire optimizer step in a single CUDA kernel rather than launching separate kernels for each operation. This reduces kernel launch…

#gpu

read full article on PyTorch Blog →

0login to vote