Does Spark Need a Superpowered Team-up?

Gian Merlino

English Session 0001-01-01 00:00 GMT+8  #streaming-bk

In just over a decade, Spark has grown from a university lab project to one of the most active projects at the Apache Software Foundations, with committers and users worldwide. As a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters, Apache Spark™ has proven its value.

But what about the things Spark doesn’t do well? Spark is great for stream processing, but what if you need to combine streams with historical batch data? Spark is fast for sequential reads, but what if you just want to retrieve a single record (or a small group of records) from the data set? Big, heavyweight queries are a core use of Spark, but what if you need a large number of concurrent queries? To complement Spark, we need an Anti-Spark that acts as an analytics ally.

Gian Merlino, Apache Druid® committer and co-founder of Imply will show how real-time analytics databases complement the capabilities of Spark, demonstrating how the two technologies work together to empower high-performance systems. Form of a reliable machine learning data workflow! Shape of a database for interactive data conversations with high concurrency and low latency combining stream and batch data!

Speakers:


Gian Merlino: Imply, Co-Founder and Chief Technology Officer, Gian is a co-founder and CTO of Imply. Gian is also one of the main committers of Apache Druid. Previously, Gian led the data ingestion team at Metamarkets and held senior engineering positions at Yahoo. He holds a B.S. in Computer Science from Caltech.