Apache Pinot (incubating) is a distributed columnar OLAP engine that can ingest data in real-time and serve analytical queries at low latency and high throughput. Pinot has evolved and matured over the last couple of years since it entered Apache Incubation. LinkedIn and Uber host the largest footprints of production Pinot installations and we leverage Pinot as the de-facto solution for high-speed analytical queries on both offline (batch) and real-time data. In this joint presentation, we will deep-dive into a few of the major features that have been contributed by LinkedIn and Uber. Specifically, we will go over the following features, briefly discuss the design and implementation, challenges encountered, and how they are being used at a huge scale within LinkedIn and Uber.
- Support for text indexes in Pinot for efficient queries on arbitrary text data.
- Solving audience reach estimation problem using theta sketch based approximate distinct count
- Upsert allows the update on the existing record with the primary key
- Geospatial support of the performant geo-related queries with the H3-based geo indexing
- Deep-store bypass to support 7x24 real-time ingestion even when the 3rd party Cloud storage is unavailable
Siddharth Teotia: Siddharth works at LinkedIn in the Pinot team part of Systems and Infrastructure group Prior to LinkedIn, he worked at Oracle for 3.5 years in the Database kernel group on storage, indexing and in-memory columnar query processing. Prior to Oracle, Siddharth worked at Dremio for 2 years as one of the early engineers building out the distributed data lake query engine. He is also a PMC member for Apache Pinot and Apache Arrow.
Yupeng Fu: Yupeng is a Staff Engineer at Uber and he leads Uber’s Real-time Platform and Infrastructure. including multiple mission-critical services powered by several open-source technologies like Kafka/Flink/Pinot. Yupeng is an Apache Pinot committer.