Distributed caching for generative AI: Optimizing the LLM data pipeline on the cloud

傅正佳

Chinese Session 2023-08-18 13:30 GMT+8  #ai

Large language model (LLM) training is a resource-intensive process that requires large amounts of storage, CPU, and GPU resources, as well as frequent input and output of numerous small files. As LLMs become more complex, the need for high-performance, scalable data processing solutions increases, especially in the context of distributed cloud training. Traditional data platform architectures struggle to maintain the required I/O throughput, resulting in underutilization of Gpus and inefficient use of resources. In this context, Alluxio’s latest distributed cache architecture system is designed to optimize the LLM data pipeline on the cloud.

Alluxio and Spark are sister projects from AMP LABS at the University of California, Berkeley. The combination of Spark+Alluxio provides high performance, scalable and powerful data processing and analysis capabilities in AI scenarios. It can accelerate large-scale data processing and machine learning tasks, providing fast data access and sharing mechanisms, while optimizing data pipelines and maintaining data consistency. This allows AI workloads to process and analyze large data sets more efficiently, accelerating model training, reasoning, and decision making processes.

  1. Design and implementation of distributed cache system and how to solve the I/O challenges of LLM training and inference
  2. Explore the unique requirements of data access models and share best practices for optimizing data pipelines through distributed caching on the cloud 3, based on Alluxio+Spark to improve efficiency and create a modern data platform
  3. Practice case: Alluxio application of Microsoft, Tencent and Zhihu
  4. Explore how to leverage scalable, efficient and powerful data infrastructure for LLM training and reasoning

Speakers:


Fu Zhengjia: Alluxio, Open source evangelist, Fu Zhengjia, Alluxio Open source evangelist. He graduated from Shanghai Jiao Tong University with a bachelor’s degree in Electronics, and then obtained a PhD in Information Engineering from the Chinese University of Hong Kong. After graduation, he joined the Singapore Advanced Digital Science Center (the Institute of the University of Illinois in Singapore) to engage in scientific research, and published several papers at top international conferences in the field of computer networks and distributed systems. Prior to joining Alluxio, he was Director of Machine Learning Research and Development at Bigo Technology, a technology company in Singapore.