Faster Bigdata Analytics by maneuvering Apache CarbonData’s Indexes

Akash R Nilugal, Kunal Kapoor

English Session 2021-08-06 16:10 GMT+8  (ROOM : B) #bigdata

Data in the 21st Century is like Oil in the 18th Century: an immensely, untapped valuable asset if processed in an intelligent way. Storage and Analysis of Big Data can be challenging and expensive both in terms of cost and time. Analytical solutions need to keep adapting themselves to keep up with the challenge of exponential data growth rate. Apache CarbonData is a unified storage solution + File Format which aims to optimize the query performance thus decreasing the analytical cost. Apache CarbonData has been adapted by 100+ open source users. In databases, one of the major features is indexes, which basically helps to query without having to scan every row. Taking inspiration from the same concept Apache CarbonData supports custom indexes like min/max, bloom, Lucene, Secondary Index, and Materialized Views for faster row level updates, deletes, OLAP and point queries. This presentation emphasizes on CarbonData’s custom index architecture + Distributed Index Cache Server which helps to provide faster query results, the future challenges and scope.

Speakers:

Akash R Nilugal: I am Apache carbondata PMC and Committer and Working as Senior Technical Lead at Cloud and AI/data platform team of Banglore Research center, Huawei. I have been working on Bigdata and mainly Apache carbondata for 5 years now and have worked and interested in areas like index support on bigdata, Materialized Views, CDC on bigdata, Spark SQL query optimizations, Spark structured streaming, data lake and data warehouse functionality, trino.cCurrently I am working on Carbondata CDC.

Kunal Kapoor: I am Apache carbondata PMC and Committer and Working as System Architect at Cloud and AI/data platform team of Banglore Research center, Huawei working on Bigdata technologies like Apache carbondata, Apache spark, Apache hive for 5 years now. Some of the major features include distributed index cache server, Hive + Carbondata integration, Pre-aggregation support, S3 support for carbondata, Secondary index on carbondata, Spark SQL query optimization in carbondata.