Advanced real-time and batch Analytics using Apache Druid

Tijo Thomas

English Session 2021-08-08 13:30 GMT+8  #streaming

Modern Streaming analytical technology excels in crunching live data in a real-time or near-real-time manner. In parallel, other big data technologies can query historical data very well. But these technologies often provide sub-optimal results when the queries involve real-time as well as historical data.

In the real world, this path often resulted in workarounds by splitting the query into real-time and historical. The former gets fired to a system handling real-time data whereas the latter is fired to historical data. The results are derived from subsequently merged output. This is suitable for scenarios where both the queries and the logic for merging the results are pre-defined. But this is not a scalable approach where the queries are ad hoc and analytical in nature and the aggregation logic spreads across both real-time and historical data.

Apache Druid overcomes the limitation of both batch and real-time systems. It will return the result on recent data combined with historical data with guaranteed correctness. This takes away the need for a merging logic. Thus Druid enables querying of real-time data in fast, flexible and with sub-second latency.

In this talk, we will review the limitations of existing analytic infrastructures built on Apache Kafka, Apache Flink, Apache Spark and other cloud-based streaming platforms for querying real-time and batch data. Finally, we go on to explain how Apache Druid helps to bridge this gap.


Tijo Thomas: Tijo Thomas is a Sr. Solutions Architect at Imply and an experienced Data Engineer. He has over 18 years of experience in software development, mostly in big data and streaming technologies. He has been helping customers in setting up their stream processing infrastructures using Apache Druid over the last couple of years. During this time he has collected best practices, patterns and anti-patterns applied in production environments.