Big Data Format at Uber Data Infra

Xinli Shang, Pavi Subenderan

English Session 2021-08-08 15:30 GMT+8 (ROOM : A) #bigdata

Big data format played an important role in data analytics for data storage efficiency, query performance, and security. In this talk, we will present our work in big data file format, Apache Parquet, at Uber for reducing storage size by ~20% to save millions of dollars, and encrypting columns to provide fine-grained access control. We will show the work we contributed to open source Parquet including ZSTD compression, high throughput column pruner, column size estimator, column encryption management, etc. On the query performance side, we will demonstrate our work related to Column Index, which is a new feature in Parquet 1.11.0.

Not only big data file format but also in-memory format, i.e. Apache Arrow, is one of the areas we look into. Use Apache Arrow as a cache layer would vastly improve the performance. We will show our vision at Uber for big data format.

Speakers:

Xinli Shang: Xinli Shang is a tech lead on the Uber Data Infra team, Apache Parquet PMC Chair. He is passionate about big data file format for efficiency, performance, and security, tuning large-scale services for performance, throughput, and reliability. He is an active contributor to Apache Parquet. He also has many years of experience developing large-scale distributed systems like S3 Index, and the operating system Windows.

Pavi Subenderan: Pavi is a Software Engineer on Uber’s Data Infra team. His focus is on data security, privacy and open source big data technologies. He has been working on Parquet column encryption for 1.5 years and more recently on data masking.

Jianchun Xu: Jianchun is a Senior Software Engineer on Uber’s Data Infra team. He has mainly worked on big data infra and parquet since joined the data team about a year ago. He also had extensive experience in service deployment platform, developer tools and web/JavaScript engine.