Bytedance Spark supports Wanka model inference practices

刘畅,张永强

Chinese Session 2023-08-18 14:00 GMT+8  #ai

With the development of cloud native, Kubernetes, due to its strong ecological construction capability and influence, makes more and more types of load applications, including big data and AI, begin to migrate to Kubernetes. Byte internal Exploration Spark migrates from Hadoop to Kubernetes. Make the operation cloud original biochemical operation. At the same time, search for a large number of offline batch processing tasks with great demand for Gpus. With the tide task volume, a series of problems are found: there is still a large gap in GPU computing power supply (card hours), the scale of single-room resource pool cannot match the growth of business unit task computing volume, the problem of online resource pool computing power waste, and the lack of a unified platform entrance. Spark and AML(Applied Machine Learning) cooperate, through GPU sharing technology, mixed GPU scheduling, Spark engine enhancement, platform and surrounding ecological improvement and other ways, support tens of thousands of cards mixed GPU model inference offline calculation. Support 8 billion multi-modal training data using mixed GPU 7k card 7.5h to complete model scoring data cleaning, and resource efficiency and stability have been significantly improved.

Speakers:


Liu Chang: bytedance, Bytedance infrastructure Engineer, Joined Bytedance in 2020 and worked in the Infrastructure batch computing team, mainly responsible for Spark cloud native direction, Spark On Kubernetes and other directions of research and development.


Zhang Yongqiang: bytedance, Machine learning systems engineer, He joined Bytedance in 2022 as part of the AML machine Learning Systems team, where he was involved in building a large-scale machine learning platform