PyTorch Blog·Infra·8d ago·by Ruilin Chen, Yuzhen Huang, Hang Qi, Mingming Ding, Damian Reeves, Boris Sarana, Kevin Tang, Satendra Gera, Gagan Jain, Sahil Shah, Oguz Ulgen, Mayank Garg, Meet Vadakkanchery, James March, Sophie Lin, Wei Sun·~1 min read

Optimizing Effective Training Time for Meta’s Internal Recommendation/Ranking Workloads

Motivation and Introduction Across the industry, teams training and serving large AI models face aggressive ROI targets under tight compute capacity. As workloads scale, improving infrastructure effectiveness gets harder because end-to-end runtime increasingly includes overheads beyond “real training” (initialization, orchestration, checkpointing, retries, failures, and recovery). Meta utilizes Effective Training Time (ETT%) to quantify efficiency, defining it as the percentage of total end-to-end (E2E) wall time dedicated to productive training. This metric directly points to areas where time is wasted, thus facilitating the prioritization of efficiency improvements. In this work stream, while grounded in Meta’s production experience using PyTorch for model training, we aim to share broadly useful lessons: some improvements have been implemented in open source—e.g., TorchRec sharding plan improvements and PyTorch 2 (PT2) compilation optimizations that reduce compile time and recompilation—while others (like checkpointing and model publishing) are more…

#inference#training

read full article on PyTorch Blog →

0login to vote