Wednesday 17:10 UTC
Visual Flow, open-source ETL tool powered by Apache Spark
Dmitri Pavlov, Alexander Shevtsov
We want to present Visual Flow – a cloud-native open-source ETL tool based on Apache Spark unified analytics engine with a Graphical User Interface. It combines the best features of Spark, Kubernetes and Argo into a single system with the following features:
- Portability, flexibility and multi-cloud compatibility
- Increased developer productivity
- High availability, performance and fault tolerance
- Open source
- Visual Flow does not require a database. It is possible to create all objects as native Kubernetes resources.
- Visual Flow leverages Kubernetes authorization to manage users and their roles within a project.
- Visual Flow provides the ability to create parameters (e.g., connection info) and reuse them in Jobs.
Dmitri Pavlov is a data integration architect working in the fields of BI and Big Data Analytics for the last 10 years. ETL tools such as IBM Datastage have been instrumental in his work allowing for faster development cycles, more efficient deployments, workload monitoring and configuration management. Visual Flow is an attempt of his department to bring some of the time-tested features of traditional ETL tools into the Big Data world.
Alexander Shevtsov is a Data Architect at IBA Group company. He has been working for about 10 years in various enterprise data-oriented projects and has solid experience in building ETL solutions. His first experience was command-based tools for traditional relational databases which eventually led to complex and challenging data integration projects in Big Data.
Under-The-Hood of Druid without Zookeeper on Kubernetes
Most distributed systems depend upon leader election, service/node discovery, node failure detectors in order to provide a resilient service that continues to work in the presence of a subset of node failures.
Some implement their own crafted algorithms, some implement raft or paxos or other well-known distributed consensus algorithms. But they are notoriously hard to get right. So, Zookeeper became a popular solution to delegate distributed system coordination.
Apache Druid initially depended on Zookeeper for most of its distributed coordination activities and having Zookeeper available was a hard requirement of operating a Druid cluster. In recent years, Druid made it possible to "plug" different extensions that could delegate distributed coordination to Zookeeper or something else.
Kubernetes is becoming a common choice for deploying and operating Druid Clusters. Kubernetes (backed by etcd, etcd is not accessible directly on most cloud kubernetes services) provides the necessary building blocks or APIs to enable leader election, service/node discovery and failure detectors for applications deployed on Kubernetes.
So, it was a logical next step to have a Druid extension that delegates Druid's distributed coordination needs to Kubernetes when Druid is deployed there.
This talk is an under-the-hood look on how Druid made distributed coordination pluggagle and how it works inside the kubernetes based extension.
Himanshu is a long term Apache Druid contributer and PMC member. He is also the author of Druid Kubernetes Operator (https://github.com/druid-io/druid-operator). Currently, He is a Principal Engineer at Cloudera bringing enterprise data cloud to customers.
In the past, He worked at Yahoo!, Splunk solving all kind of challenges to make sense out of large scale Data.
Building modern SQL query optimizers with Apache Calcite
Query optimization is one of the most challenging problems in database systems. For many years, creating a query optimizer was considered black art, available only to a limited number of companies and products.
Not any more. Apache Calcite is an open-source framework that allows you to build query engines, and query optimizers in particular, at a significantly lower engineering cost. In this talk, I will present query optimization capabilities of Apache Calcite, including cost-based and heuristic optimization drivers and an extensive library of optimization rules.
Vladimir Ozerov is a founder of Querify Labs, where he manages the research and development of innovative data management products for technology companies. Before that, Vladimir worked on in-memory data platforms Apache Ignite and Hazelcast for more than eight years, focusing on distributed data processing. Vladimir is a contributor to Apache Calcite and a committer to Apache Ignite.Wednesday 19:40 UTC
Improving interactive querying experience on Spark SQL
Ashish Singh, Sanchay Javeria
Being a data driven company, interactive querying on hundreds of petabytes of data is a very common and an important function at Pinterest. Interactive querying has different requirements and challenges from batch querying. In this talk, we will talk about various architectural alternatives one can choose from to perform interactive querying with Spark SQL. Through thorough discussion on trade-offs of each of those architectures and requirements for interactive querying, we will elaborate on the reasoning for our design choice. We will further share enhancements we made to open source projects including Apache Spark, Apache Livy and Dr. Elephant along with in-house technologies we built to improve interactive querying experience at Pinterest. We will share enhancements like DDL query speed ups, spark session caching, spark session sharing, Apache Yarn’s diagnostic message improvements, query failure handling and query tuning recommendations. We will also discuss some challenges we faced along the way and future improvements we are working on.
Ashish Singh is a tech lead on BigData Query Processing Platform team and focuses on making expressing computational needs with SQL easier, faster, and reliable. As an open source enthusiast and an Apache committer he has contributed to multiple open source big data projects, including Presto, Apache Parquet, Kafka, Hive and Sentry over the past several years. Before joining Pinterest, Ashish worked at Cloudera. Ashish holds a M.S. in computer science from The Ohio State University.
Sanchay Javeria is a Software Engineer on the BigData Query Processing Platform team and focuses on making the in-house interactive SQL query pipelines more efficient, reliable and feature rich. Prior to joining Pinterest, Sanchay finished his B.S. in computer science from The University of Illinois at Urbana-Champaign.
Apache Hive Replication V3
Pravin Kumar Sinha
Ability to schedule and manage the replication policy within hive has been a gap in the replication solution hive provided. Hive replication V3 comes up with many enhancement including but not limited to the replication policy management.
This talk will talk about the new features introduced in Hive Replication V3, like Scheduling, Check-pointing, Acknowledgment, Support for Atlas and Ranger metadata replication and few performance enhancement we did as a part of it. For performance enhancement, it will discuss on enhancements like load partitions in batches for managed table, Waiting for on-going transactions only for the databases under replication etc.
Pravin is a committer of Apache Hive project. Pravin is currently working in Cloudera, focusing the the Hive Replication component. Pravin has keen interest in distributed systesm. With more than 13 years of experience in software development, Pravin has worked on distributed backend sytesms like XMPP server, Indexing and Search solution for Email Server etc.Thursday 15:50 UTC
Challenges of Spark Application coexisting with NoSQL databases.
CapitalOne is first US bank to exist out of on-premises and moved completely on Cloud. Over this process of modernizing our application in CapitalOne Card Rewards, we developed ground up custom transactions processing application on open source technologies like Spark, Mongo, Cassandra etc. This application currently processes millions of customer transactions daily providing them millions of miles, cash and points everyday. In process of building our application, we came across many challenging issues to have Spark application process data from MongoDB and Cassandra backend to serve customers. This talk is going to focus on few of those issues, what is the impact of those issue and how to mitigate them.To call out specifically following are list of issues this talk will focus on.
- How Cassandra Key sequence is important and how it impacts in querying
- How Cassandra batching helps and works well with Spark partitions
- Importance of Cassandra Data Modeling and its implications after MVP/Deployment
- How to manage Mongo Connection (at JVM level)
- Implications of using MongoSpark connector on its Partitioner
Gokul Prabagaren is an Engineering Manager at CapitalOne - Rewards Org, specialized in Distributing computing. Developed distributed Cloud Native applications based on Spark, Cassandra and Mongo which are currently serving millions of customers everyday. Previously developed Java apps from 1.2 on-prem & VMs.Thursday 17:10 UTC
Storage-Partitioned Join for Apache Spark
Chao Sun, Ryan Blue
Spark currently supports shuffle join (either shuffled hash join or sort-merge join) which relies on both sides to be partitioned by Spark’s internal hash function. To avoid the potentially expensive shuffle phase, users can create bucketed tables and use bucket joins which ensure data is pre-partitioned on disk when written, using the same hash function, and thus won’t be shuffled again in subsequent joins.
A storage partitioned join extends the idea beyond hash-based partitioning, and allows other types of partition transforms: two tables that are partitioned by hour could be joined hour-by-hour, or two tables partitioned by date and a bucket column could be joined using date/bucket partitions. It also paves the way for data sources to provide their own hash functions, for example, bucketed tables created by Hive.
In this talk we propose extensions for Spark that will enable storage partitioned joins for v2 data sources, using partitions produced by any transform or a combination of partition values. This also proposes an extension to Spark’s distribution model to avoid processing large partitions in a single task.
Chao is currently a software engineer at Apple, working on Apache Spark. Previously was involved in various Apache projects such as Apache Hive, Apache Hadoop, as well as Apache Arrow.
Ryan Blue is the co-creator and PMC chair of Apache Iceberg and works on open source data infrastructure. He is an ASF member, Avro and Parquet PMC member, and a Spark committer.
Casting the spell: Druid in practice
Itai Yaffe, Yakir Buskilla
At Nielsen Identity, we leverage Apache Druid to provide our customers with real-time analytics tools for various use-cases, including in-flight analytics, reporting and building target audiences.
The common challenge of these use-cases is counting distinct elements in real-time at scale.
We’ve been using Druid to solve these problems for the past 5 years, and gained a lot of experience with it.
In this talk, we will share some of the best practices and tips we’ve gathered over the years.
We will cover the following topics:
- Data modeling
- Retention and deletion
- Query optimization
Itai Yaffe is a Senior Solutions Architect at Databricks. Prior to Databricks, Itai was a Principal Solutions Architect at Imply (founded by the original creators of Apache Druid), and before that, a big data tech lead at Nielsen Identity, where he dealt with big data challenges using tools like Spark, Druid, Kafka, and others. He is also a part of the Israeli chapter's core team of Women in Big Data. Itai is keen about sharing his knowledge and has presented his real-life experience in various forums in the past.
Yakir Buskilla is co-founder and CEO of cocohub.ai, and previously the SVP R&D and GM Israel at Nielsen Identity. His fields of interest are Big Data solutions and large scale machine learning.