Bytedance deep learning batch flow integrated training practice

毛洪玥

Chinese Session 2023-08-18 16:15 GMT+8 #ai

With the development of the company’s business, the algorithm complexity continues to increase, and more and more algorithm models explore real-time training on the basis of offline update to improve the model effect. In order to realize flexible scheduling and free switching of complex off-line and real-time training, and to schedule off-line computing resources in a wider range, machine learning model training gradually tends to be batch flow integration. The batch training data in Bytedance is mainly based on Apache Iceberg, Apache HDFS, etc., and the streaming data is mainly based on Apache Kafka. In this context, we practice and open source batch flow integrated machine learning training framework with flexible scheduling ability of massive multi-stage multi-source data and efficient training. Support Bytedance daily 10,000 + jobs, 5 million core CPU tasks, 10,000 card GPU tasks, single task average data volume 500TB training scale. We will share some content including the architecture evolution of Bytedance machine learning training scheduling framework, batch flow integrated practice, heterogeneous elastic training and so on. In MFTC (Batch Flow integrated cooperative training) scenario, the practical experience of multi-stage and multi-data source hybrid orchestration, global Shuffle of streaming samples, full link Native, training data insight and so on are introduced. The new scheduling architecture realizes more efficient use of machine resource pool, unified resource scheduling entrance, more flexible multi-role scheduling, flexible expansion and contraction, and improved resource utilization. The hybrid training capability of batch flow can support higher data consumption throughput, achieve flexible offline data and real-time data mixing orchestration, and provide data priority protection and data visualization capabilities.

Speech outline:

Current situation and background
- Introduction to the overall structure of batch flow training: Advantages of IceBerg and Kafka selection
- Evolution process of Bytedance scheduling framework
- Introduction of Primus open source project
Batch flow training practice
- Batch Flow integrated business background, problems and challenges
- Bytedance practical experience
  - IceBerg and Kafka multi-stage multi-data source orchestration
  - DataLoading technology evolution and performance optimization
  - Schedule All2Allshuffle and Batch/Stream priorities
  - Insight Training data insight
Primus Flow practice: Combine with Spark to achieve training with pre-processing function

Speakers:

Mao Hongyue: Infrastructure Engineer at ByteDance. Joined ByteDance in 2022, working on the development of machine learning training, mainly responsible for large-scale cloud-native batch-stream integrated AI model training engine, supporting various businesses, including Douyin video recommendation, Toutiao recommendation, ChuanShanJia advertising, and Qianchuan graphic advertising.