Scaling Impala - Common Mistakes and Best Practices

Manish Maheshwari

English Session 2021-08-06 13:30 GMT+8  (ROOM : A) #bigdata

Apache Impala is a complex engine and requires a thorough technical understanding to utilize it fully. Without proper configuration or usage Impala’s performance becomes unpredictable, and end user experience suffers.

For many users/administrators the right configuration of Impala is still a mystery. During our work with some of the largest clusters in the world we’ve found a set of common mistakes in configuration and usage that lead to a lot of frustration.

In this talk, we will discuss ingestion best practices to keep an Impala deployment scalable and admission control configuration to provide consistent experience to end users. We will also take a high-level look at Impala’s query profile which is used as a first stop in any performance troubleshooting. In addition, we will discuss common mistakes users and BI tools make when interacting with Impala. Lastly, we will go over an ideal setup to show all of this in practice.


Manish Maheshwari: I have 15+ years of experience building extremely large data warehouses and analytical solutions. He’s worked extensively on Apache Hadoop, DI, and BI tools, data mining and forecasting, data modeling, master and metadata management, and dashboard tools. He is proficient in Hadoop, SAS, R, Informatica, Teradata, and Qlikview.