Apache Arrow based DataFrame for Data Processing in Python

Supun Kamburugamuve

English Session 2021-08-08 16:50 GMT+8  (ROOM : A) #bigdata

Machine learning (ML) and deep learning (DL) fields have made amazing progress in the past few years. Modern ML/DL applications have outgrown resource requirements beyond a single node’s capability. However, this is just a small part of the overall data processing environment, which must also support a raft of data engineering for pre-and post-data processing, communication, and system integration. The data tools surrounding the ML/DL applications need to be able to easily integrate with existing ML/DL frameworks in a multitude of languages, which particularly increases user productivity and efficiency. All this demands an efficient and highly distributed integrated approach for data processing, yet many of today’s popular data analytics tools are unable to satisfy all these requirements at the same time.

This presentation introduces Cylon DataFrame, which is an open-source high-performance distributed Python API similar to Pandas. It is developed with a flexible C++ core on top of the Apache Arrow data format. The presentation discusses Cylon’s architecture and how it works at scale. Initial experiments show that Cylon outperforms popular tools such as Apache Spark and Dask with major performance improvements for key operations with the potential to integrate with them. Finally, we show how Cylon can enable high-performance data pre-processing in popular AI tools such as Pytorch, Tensorflow, and Jupyter notebook without taking away Data scientists’ productivity.


Supun Kamburugamuve: Supun Kamburugamuve has a Ph.D. in computer science specializing in high-performance data analytics. He is working in the role of a principal software engineer at the Digital Science Center of Indiana University where he leads Twister2 and Cylon high-performance data analytics projects. Supun is an elected member of the Apache Software Foundation and has contributed to many open-source projects including Apache Web Services projects and Apache Heron. Before joining Indiana University, Supun worked on middleware systems and was a key member of the WSO2 ESB project, an open-source enterprise integration solution being widely used by enterprises. Supun has given many technical talks at research conferences and technical conferences including Strata NY, Big Data Conference, and Apache Con.