ApacheCon @Home - Big Data Track

Apache Big Data Track

Tuesday 16:15 UTC
Apache Hadoop YARN: Past, Now and Future
Szilard Nemeth, Sunil Govindan

Apache Hadoop YARN is an integral part of on-premiss solutions and it will be for the foreseeable future. We also believe, it will have an important role in the Cloud as well allowing users to move their solutions both to private and public cloud as easily as possible. For this reason, the development of YARN took a new momentum in the last year. In this talk, we will talk about the latest updates from the community which will be released in Hadoop 3.3.x/3.4.x releases: For compute acceleration devices (including GPU/FPGA, etc.), we will talk about better GPU support (including GPU hierarchy scheduling support, Nvidia-docker v2 support, etc.); YARN’s device plugin framework to allow developers easier add new compute-acceleration devices; FPGA support, etc. For scheduling related improvements, we will talk about new improvements of global scheduling in CapacityScheduler, new efforts to bring dynamic-queue-creation/absolute-resource into production, and one of the recent work, fs2cs, which allows user migrate from FairScheduler to CapacityScheduler. Apart from these, we will talk about new containerization improvements: runc support, improvements of log aggregation to better support cloud storage, etc. Audiences will get the latest development progress of Apache Hadoop YARN, and help them to make a decision when upgrading to Hadoop 3.x.

Szilard Nemeth:
Szilard Nemeth has many years of development experience mainly in Java. He joined Cloudera’s YARN team late 2017 and has been key to transfer YARN knowledge from Palo Alto to Budapest, Hungary. He has become a Hadoop committer in the Summer of 2019. Through out his career in YARN he mostly focused on Custom Resource Types, GPU support and recently he has been involved in making YARN more flexible and ready to the Cloud. He lives in Budapest but when has the time, he loves traveling.
Sunil Govindan:
Engineer Manager @Cloudera. Contributing to Apache Hadoop project since 2013 in various roles as Hadoop Contributor, Hadoop Committer, and a member Project Management Committee (PMC). Majorly working on YARN Scheduling improvements / Multiple Resource types support in YARN etc. He also served as Apache Submarine PMC member, Apache YuniKorn (incubating) PMC member.

Tuesday 16:55 UTC
Hadoop Storage Reloaded: the 5 lessons Ozone learned from HDFS
Márton Elek

Apache (Hadoop) Ozone is a brand-new storage system for the Hadoop ecosystem. It provides Object Store semantics (like S3) and can handle billions of objects. Ozone doesn't depend on HDFS but it's the "spiritual successor" of it. The lessons learned during the 10+ years of HDFS helped to design a more scalable object store. This presentation explains the key challenges of a storage system and shows how the specific problems can be answered.

Marton Elek is PMC in Apache Hadoop and Apache Ratis projects and working on the Apache Hadoop Ozone at Cloudera. Ozone is a new Hadoop sub-project which provides an S3 compatible Object Store for Hadoop on top of a new generalized binary storage layer. He is also working on the containerization of Hadoop and creating different solutions to run Apache Big Data projects in Kubernetes and other could native environments.

Tuesday 17:35 UTC
Building efficient and reliable data lakes with Apache Iceberg
Anton Okolnychyi, Vishwanath Lakkundi

Apache Iceberg is a table format that allows data engineers and data scientists to build reliable and efficient data lakes with features that are normally present only in data warehouses. This talk will be a deep dive into key design principles of Apache Iceberg that enable the following features on top of data lakes: - ACID compliance on top of any object store or distributed file system - Flexible indexing capabilities which boost the performance of highly selective queries - Implicit partitioning using partition transforms - Reliable schema evolution - Time travel and rollback These advanced features let companies substantially simplify their current architectures as well as enable new use cases on top of data lakes.

Anton is a committer and PMC member of Apache Iceberg as well as an Apache Spark contributor at Apple. At Apple, he is working on making data lakes efficient and reliable. Prior to joining Apple, he optimized and extended a proprietary Spark distribution at SAP. Anton holds a Master’s degree in Computer Science from RWTH Aachen University.
Vishwanath Lakkundi is the engineering lead for the team that focuses on Data Orchestration and Data Lake at Apple. This team is responsible for development of an elastic fully managed Apache Spark as a service, a Data Lake engine based on Apache Iceberg and a data pipelines product based on Apache Airflow. He has been working with Apple since the last seven years focusing on various analytics infrastructure and platform products.

Tuesday 18:15 UTC
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Panagiotis Garefalakis, Stamatis Zampetakis

Apache Hive is an open-source relational database system that is widely adopted by several organizations for big data analytic workloads. It combines traditional MPP (massively parallel processing) techniques with more recent cloud computing concepts to achieve the increased scalability and high performance needed by modern data intensive applications. Even though it was originally tailored towards long running data warehousing queries, its architecture recently changed with the introduction of LLAP (Live Long and Process) layer. Instead of regular containers, LLAP utilizes long-running executors to exploit data sharing and caching possibilities within and across queries. Executors eliminate unnecessary disk IO overhead and thus reduce the latency of interactive BI (business intelligence) queries by orders of magnitude. However, as container startup cost and IO overhead is now minimized, the need to effectively utilize memory and CPU resources across long-running executors in the cluster is becoming increasingly essential. For instance, in a variety of production workloads, we noticed that the memory bandwidth of early decoding all table columns for every row, even when this row is dropped later on, is starting to overwhelm the performance of single query execution. In this talk, we focus on some of the optimizations we introduced in Hive 4.0 to increase CPU efficiency and save memory allocations. In particular, we describe the lazy decoding (or row-level filtering) and composite bloom-filters optimizations that greatly improve the performance of queries containing broadcast joins, reducing their runtime by up to 50%. Over several production and synthetic workloads, we show the benefit of the newly introduced optimizations as part of Cloudera’s cloud-native Data Warehouse engine. At the same time, the community can directly benefit from the presented features as are they 100% open-source!

Panagiotis Garefalakis:
Panagiotis Garefalakis is a Software Engineer at Cloudera where he is part of the Data Warehousing team. He holds a Ph.D. in Computer Science from Imperial College London were he was affiliated with the Large-Scale Data & Systems (LSDS) group. His interests lie within the broad area of systems including large-scale distributed systems, cluster resource management, and big data processing.
Stamatis Zampetakis:
Stamatis Zampetakis is a Software Engineer at Cloudera working on the Data Warehousing product. He holds a PhD in Big Data management on massively parallel systems

Tuesday 19:35 UTC
Big Data File Format Cost Efficiency - Millions of Dollars Deal
Xinli Shang, Juncheng Ma

Reducing the size of data at rest and in transit is critical to many organizations, not only because it can save the cost of storage but also can improve the IO usage and traffic volume in the network. We will present how to translate hundreds of petabytes data compressed in GZIP in the data lake to a more efficient compression method - ZSTD which can reduce the data size by 10% and save millions of dollars. We will show the recent Apache Parquet format improvement that makes ZSTD compression to be easily set up (PARQUET-1866). The tech talk will also demonstrate how we solve the challenges of the compression translation speed by improving the throughput by 5X (PARQUET-1872). As a result, the translation time of large scale data sets can be reduced from months to days and save compute vCores correspondingly. The translation needs to be in a safe way to prevent data corruption and incompatibility. We will also show the technicals that are built into the compression translation tool to prevent them from happening. Another significant storage size reduction (up to 80%) can be done by 1) reordering the columns in Parquet to make it more friendly to encoding and compression, 2) encoding with BYTE_STREAM_SPLIT (PARQUET-1622) that is more efficient for floating type data, 3) reducing geolocation data precision to make RLE more efficient, 4) pruning unused columns (PARQUET-1800). We will show above every technique, the effectiveness of each and the reason behind it. We will also show the tools like Parquet column-size (PARQUET-1821) that can help users to identify the candidate tables to apply the above techniques.

Xinli Shang:
Xinli Shang is a tech lead on the Uber Data Infra team, Apache Parquet Committer. He is passionate about big data file format for efficiency, performance and security, tuning large scale services for performance, throughput, and reliability. He is an active contributor to Apache Parquet. He also has many years of experience developing large scale distributed systems like S3 Index, and operating system Windows.
Juncheng Ma :
Software Developer at Uber

Wednesday 09:00 UTC
Integrate Apache Flink with Cloud Native Ecosystem
Yang WangTao Yang

With the vigorous development of cloud-native and serverless computing, there are more and more big-data workloads, especially Apache Flink workloads, running in Alibaba cloud for better deployment and management, backed by Kubernetes. This presentation introduces the experiences of intergrating Flink with cloud-native ecosystem, including the improvements in Flink to support elasticity and natively running on Kubernetes, the experiences about managing dependent components like ZooKeeper, HDFS etc. and leveraging Kubernetes service/network/storage extensions, better supports for big-data workloads to satisfy requirements on multi-tanent management, resource fairness, resource elasticity etc. and achieve high scheduling performance via Apache YuniKorn.

Yang Wang:
Technical Expert of Alibaba's realtime computing team, Apache Flink Contributor, focusing on the direction of resource scheduling in Flink.
Tao Yang:
Technical Expert of Alibaba's realtime computing team, Apache Hadoop Committer, Apache YuniKorn Committer, focusing on the direction of resource scheduling in YARN and Kubernetes.

Wednesday 16:15 UTC
Next Gen Data Lakes using Apache Hudi
Balaji Varadarajan, Sivabalan Narayanan

Data Lakes are one of the fastest growing trends in managing big data across various industries. Data Lakes offer massively scalable data processing over vast amounts of data. One of the main challenges that companies face in building a data lake is designing the right primitives for organizing their data. Apache Hudi helps implement uniform, best-of-breed data lake standards and primitives. With such primitives in place, next generation data lake would be about efficiency and intelligence. Businesses expect their data lake installations to cater to their ever changing needs while being cost efficient. In this talk, we will discuss new features in Apache Hudi that is catered towards building next-gen data-lake. We will start with basic Apache Hudi primitives such as upsert & delete required to achieve acceptable latencies in ingestion while at the same time providing high quality data by enforcing schematization on datasets. We will look into the novel “record level index” supported by Apache Hudi and how it supports efficient upserts. We will then dive into how Apache Hudi supports query optimization by leveraging its rich metadata. Efficient storage management is a key requirement for large data lake installation. We will look at how Apache Hudi supports intelligent and dynamic re-clustering of data for better storage management and faster query times. Finally, we will discuss how to easily onboard your existing dataset to Apache Hudi format, so you can leverage Apache Hudi efficiency without making any drastic changes to your existing data lake.

Balaji Varadarajan:
Balaji Varadarajan is currently a Staff Engineer in Robinhood's data platform team. Previously he was a tech lead in Uber data platform working on Apache Hudi and Hadoop platform at large. Previously, he was one of the lead engineers in LinkedIn’s databus change capture system. Balaji’s interests lie in large-scale distributed data systems.
Sivabalan Narayanan:
Sivabalan Narayanan is a senior software engineer at Uber overseeing data engineering broadly across the network performance monitoring domain. He is an active contributor to Apache Hudi and also big data enthusiasist whose interest lies in building data lake technologies. Previously, he was one of the core engineers responsible for builiding Linkedin's blob store.

Wednesday 16:55 UTC
A Production Quality Sketching Library for the Analysis of Big Data
Lee Rhodes

In the analysis of big data there are often problem queries that don’t scale because they require huge compute resources to generate exact results, or don’t parallelize well. Examples include count-distinct, quantiles, most frequent items, joins, matrix computations, and graph analysis. Algorithms that can produce accuracy guaranteed approximate answers for these problem queries are a required toolkit for modern analysis systems that need to process massive amounts of data quickly. For interactive queries there may not be other viable alternatives, and in the case of real-time streams, these specialized algorithms, called stochastic, streaming, sublinear algorithms, or 'sketches', are the only known solution. This technology has helped Yahoo successfully reduce data processing times from days to hours or minutes on a number of its internal platforms and has enabled subsecond queries on real-time platforms that would have been infeasible without sketches. This talk provides an introduction to sketching and to Apache DataSketches, an open source library in C++, Java and Python of algorithms designed for large production analysis systems.

Lee Rhodes is a Distinguished Architect at Yahoo (now Verizon Media). He created the DataSketches project in 2012 to address analysis problems in Yahoo's large data processing pipelines. DataSketches was Open Sourced in 2015 and is in incubation at Apache Software Foundation. He was an author or coauthor on sketching work published in ICDT, IMC, and JCGS. He obtained his Master's Degree in Electrical Engineering from Stanford University.

Wednesday 17:35 UTC
Cylon - Fast Scalable Data Engineering
Supun Kamburugamuve, Niranda Perera

Machine learning (ML) and deep learning (DL) fields have made amazing progress in the past few years. Modern ML/DL applications have outgrown resource requirements beyond a single node's capability. However, this is just a small part of the issues facing the overall data processing environment, which must also support a raft of big data engineering for pre- and post-data processing, communication, and system integration. The big data tools surrounding the ML/DL applications need to be able to easily integrate with existing ML/DL frameworks in a multitude of languages, which particularly increases user productivity and efficiency. All this demands an efficient and highly distributed integrated approach for data processing, yet many of today's popular data analytics tools are unable to satisfy all these requirements at the same time. This presentation introduces Cylon, an open-source high performance distributed data processing library that can be seamlessly integrated with the existing Big Data and AI/ML frameworks. It is developed with a flexible C++ core on top of the Apache Arrow data format and exposes language bindings to C++, Java, and Python. The presentation discusses Cylon's architecture in detail and reveals how it can be imported as a library to existing applications or operate as a standalone framework. Initial experiments show that Cylon outperforms popular tools such as Apache Spark and Dask with major performance improvements for key operations with the ability to integrate with them. Finally, we show how Cylon can enable high-performance data pre-processing in popular AI tools such as Pytorch, Tensorflow, and Jupyter notebook without taking away Data scientists’ productivity.

Supun Kamburugamuve:
Supun Kamburugamuve has a Ph.D. in computer science specializing in high-performance data analytics. He is working in the role of a principal software engineer at the Digital Science Center of Indiana University where he leads Twister2 and Cylon high-performance data analytics projects. Supun is an elected member of the Apache Software Foundation and has contributed to many open-source projects including Apache Web Services projects and Apache Heron. Before joining Indiana University, Supun worked on middleware systems and was a key member of the WSO2 ESB project, which is an open-source enterprise integration solution being widely used by enterprises. Supun has given many technical talks at research conferences and technical conferences including Strata NY, Big Data Conference, and Apache Con.
Niranda Perera:
Niranda Perera is a second-year grad student at Indiana University - Luddy School of Informatics, Computing, and Engineering (SICE). He is enrolled in the Intelligent Systems Engineering Department and advised by Prof. Geoffery Fox. He is an active contributor to the Twister2 project and a founding member of the Cylon high-performance data analytics project. He completed his Bachelor's at the University of Moratuwa, Sri Lanka, and prior to joining IU, he was a contributor to WSO2 Data Analytics Server, which was a part of WSO2 middleware stack.

Wednesday 18:15 UTC
Snakes on a Plane: Interactive Data Exploration with PyFlink and Zeppelin Notebooks
Marta Paes

Stream processing has fundamentally changed the way we build and think about data pipelines — but the technologies that unlock the value of this powerful paradigm haven’t always been friendly to non-Java/Scala developers. Apache Flink has recently introduced PyFlink, allowing developers to tap into streaming data in real-time with the flexibility of Python and its wide ecosystem for data analytics and Machine Learning. In this talk, we will explore the basics of PyFlink and showcase how developers can make use of a simple tool like interactive notebooks (Apache Zeppelin) to unleash the full power of an advanced stream processor like Flink.

Marta is a Developer Advocate at Ververica (formerly data Artisans) and a contributor to Apache Flink. After finding her mojo in open source, she is committed to making sense of Data Engineering through the eyes of those using its by-products. Marta holds a Master’s in Biomedical Engineering, where she developed a particular taste for multi-dimensional data visualization, and previously worked as a Data Warehouse Engineer at Zalando and Accenture.

Wednesday 18:55 UTC
Unified Data Processing with Apache Spark and Apache Pulsar
Jia Zhai

Lambda is widely used in the industry when people need to process both real-time and historical data to get a result. It is effective, and a good balance of speed and reliability. But there are still challenges to use Lambda in the practice. The biggest detraction has been the need to maintain two distinct (and possibly complex) systems to generate both batch and streaming layers. Thus, the operational cost of maintaining multiple clusters is nontrivial, and in some cases, one business logic would have to be split into many segments across different places, which is a challenge to maintain as the business grows and it also increases communication overhead. In this session, we'd like to present a unique data processing architecture with Apache Spark and Apache Pulsar, a solution, with the core idea of "One data storage, one computing engine, and one API", to solve the problems of Lambda architecture.

Jia Zhai is the co-founder of StreamNative, as well as PMC member of both Apache Pulsar and Apache BookKeeper, and contributes to these two projects continually.

Wednesday 19:35 UTC
Cluster Management in Apache Ecosystem & Kubernetes
Shekhar Prasad Rajak

Apache have powerful cluster & resource manager already, so do we really need to use Kubernetes for the deployment while using Apache projects ? Let's find out what type of cluster management system Apache already have ,How cluster management works in each of below cases and when we don't need any other cluster management top of it and when we can leverage the power of both this apache cluster modes and Kubernetes in resource & cluster management. * Apache Spark Standalone: A simple cluster manager available as part of the Spark distribution.. * Apache Mesos: A general purpose distributed OS level push based scheduler & resource manager. * Apache Hadoop YARN: A distributed computing framework for monolithic job scheduling and cluster resource management for Hadoop cluster (Apache/CDH/HDP) We will see some benchmarks and features that kubernetes can provide but it is not present(or not mature enough) in the Apache ecosystem, but still using, one or both can improve the performance. We will deep dive into fundamentals of Kubernetes and Apache distribution, resource & cluster management system, Job scheduling, to get clear cut idea behind both ecosystems and why they are best in particular cases like Big Data, Machine Learning, Load balancer, and so on. Applications are containerised in Kubernetes Pod, Kubernetes Service is used as Load balancer, Kubernetes High availability is because of distribution of Pods in worker nodes, Local Storage, Persistent volume & Networking and many other features will be compared side by side with Apache Ecosystem. Like in Mesos, Application Group models dependencies as a tree of groups and Components are started in dependency order, Mesos-DNS works as basic load balancer, applications distribution among slave nodes, two-level scheduling mechanism, modern kernel "cgroups" in Linux & "zones" in Solaris, and so on. Along with the comparison & benchmark the talk will provide practical guide to use the Apache project with Kubernetes. Audience will understand the Software System design and generic problems of processing the request through the cluster & resource managers and why it is important to have modular, micro service based, loosely coupled software design, so that it can easily go through the container or OS level cluster management systems. This talk is clearly not to show who is winning but how can you win in your time, in the dark situation.

Shekhar is passionate about Open Source Softwares and active in various Open Source Projects. During college days he has contributed SymPy - Python library for symbolic mathematics , Data Science related Ruby gems like: daru, dart-view(Author), nyaplot - which is under Ruby Science Foundation (SciRuby), Bundler: a gem to bundle gems, NumPy & SciPy for creating the interactive website and documentation website using sphinx and Hugo framework, CloudCV for migrating the Angular JS application to Angular 8, and few others. He has successfully completed Google Summer of Code 2016 & 2017 and mentored students after that on 2018, 2019. Shekhar also talked about daru-view gem in RubyConf India 2018 and PyCon India 2017 on SymPy & SymEngine.

Thursday 16:15 UTC
Column encryption & Data Masking in Parquet - Protecting data at the lowest layer
Pavi Subenderan, Xinli Shang

In a typical big dataset, only a minority of columns are actually sensitive and need to be protected. Columnar file formats like Apache Parquet allow for column level access control through encryption. This means the small number of sensitive columns in a dataset can be protected through encryption, while the non-sensitive columns can be open for access. Data masking features for encrypted columns bring further convenience and allows users to leverage encrypted columns even without access to them. The combination of column encryption and data masking maximizes accessibility to your data without compromising the security of sensitive data. In the first half, we will go over column encryption design and features in Parquet. We will cover considerations when operating parquet column encryption in production like Key Management Service, performance tradeoffs, encryption algorithm choice, etc. In the second half, we will cover the new Data Masking features in Parquet. There will be discussion about motivation behind data masking, the security implications of masking and implementation. Finally we will look at the tradeoffs between the different types of masks and the limitations of each type in terms of compression ratio, table joins, usability and administration overhead.

Pavi Subenderan:
Pavi is a Software Engineer on Uber's Data Infra team. His focus is on data security, privacy and open source big data technologies. He has been working on Parquet column encryption for 1.5 years and more recently on data masking.
Xinli Shang:
Xinli Shang is a tech lead on the Uber Data Infra team, Apache Parquet Committer. He is passionate about big data file format for efficiency, performance and security, tuning large scale services for performance, throughput, and reliability. He is an active contributor to Apache Parquet. He also has many years of experience developing large scale distributed systems like S3 Index, and operating system Windows.

Thursday 16:55 UTC
GDPR’s Right to be Forgotten in Apache Hadoop Ozone
Dinesh Chitlangia

Apache Hadoop Ozone is a robust, distributed key-value object store for Hadoop with layered architecture and strong consistency. It isolates the namespace management from the block and node management layer, which allows users to independently scale on both axes. Ozone is interoperable with the Hadoop ecosystem as it provides OzoneFS (Hadoop compatible file system API), data locality and plug-n-play deployment with HDFS as it can be installed in an existing Hadoop cluster and can share storage disks with HDFS. Ozone solves the scalability challenges with HDFS by being size agnostic. Consequently, it allows users to store trillions of files in Ozone and access them as if they are on HDFS. Ozone plugs into existing Hadoop deployments seamlessly, and programs like Yarn, MapReduce, Spark, Hive and work without any modifications. In the era of increasing need for data privacy and regulations, Ozone also provides built-in support for GDPR compliance with strong focus on Right to be Forgotten i.e., Data Erasure. At the end of this presentation the audience will be able to understand: 1. HDFS scalability challenges 2. Ozone’s Architecture as a solution 3. Overview of GDPR 4. GDPR implementation in Ozone

Dinesh is a Software Engineer with strong expertise in Java, Distributed Systems for ~10 years. He has been involved with Hadoop ecosystem for the last 4 years and is an Apache Hadoop Committer. Dinesh is currently working at Cloudera, performing the role of a proactive support consultant for Premier customers and enjoys contributing to Open Source community. Outside of technology, Dinesh has a serious hobby in Landscape/Cityscape photography.

Thursday 17:35 UTC
Global File System View Across all Hadoop Compatible File Systems with the LightWeight Client Side Mount Points.
Uma Maheswara Rao Gangumalla

Apache Hadoop File System layer has integrations to many popular storage systems including cloud storages like S3, Azure Data Lake Storage etc, along with in-house Apache Hadoop Distributed File System. When users want to migrate data between file systems, it’s very difficult for them to update their meta storages when they persist file system paths with schemes. For Example the Apache Hive persists the URI paths in meta-store. In Apache Hadoop, we came up with a solution(HDFS-15289) for this problem, i.e, the View FileSystem Overload Scheme with the configurable scheme and mount points. In this talk, we will cover in details, how users can enable it and how easily users can migrate data between file systems without modifying their meta-store. It’s completely transparent to users with respective to the file paths. We will present one of the use cases with Apache Hive partitioning, that is the user can move one/some of their partition data to a remote file system and just add a mount point on the default file system(ex: HDFS) where the user was working with. Here Hive queries will work transparently from the user point of view even though the data resides in a remote storage cluster ex: Apache Hadoop Ozone or S3. This will be very useful when users want to move certain kinds of data, ex: Cold Partitions, Small Files can be moved to remote clusters from a primary HDFS cluster without affecting applications. The Mount tables are maintained at the central server, all clients will load the tables while initializing the file system and also can refresh on modification of mount points. So, that all the initializing clients will be in sync. This will make user’s life easier to migrate data between cloud and on-premise storages in a much flexible way.

Uma Maheswara Rao Gangumalla is an Apache Software Foundation Member[1]. An Apache Hadoop, BookKeeper, Incubator committer and a member of the Project Management Committee[2], and a long-term active contributor to the Apache Hadoop project. He is also mentoring several incubator projects at Apache. Uma holds a bachelor's degree in Electronics and Communications Engineering. He has more than 13 years of experience in large scale Distributed Software Platforms design and development. Currently, Uma is working as a Principal Software Engineer at Cloudera, Inc, California, and primarily focuses on open source big data technologies. Prior to this, Uma worked as a Software Architect in Intel Corporation, California. [1] https://www.apache.org/foundation/members.html [2] http://people.apache.org/phonebook.html?uid=umamahesh

Thursday 18:15 UTC
Apache Hadoop YARN fs2cs: Converting Fair Scheduler to Capacity Scheduler
Benjamin Teke

Apache Hadoop YARN has two popular schedulers, Fair Scheduler and Capacity Scheduler. Although the two are based on different principles, convergent evolution pushed them to be similar both in functionality and the feature set they offer. By now it seems to be a good idea to merge the two or chose one over the other so the entire user base can enjoy one unified support effort and knowledge base. In this talk, we will present our approach which is offering users a way to migrate from Fair Scheduler to Capacity Scheduler by exploring migration paths and filling feature parity gaps. We will also talk about challenges and those aspects of the migration need some engineering effort in order to keep the achievements of fine-tuning Fair Scheduler installations over many years. We will explain why Capacity Scheduler is our scheduler of choice, the way we analyzed differences between the two schedulers, how we found migration paths, and finally, we will present a tool (fs2cs) we developed to help users automate the process.

Benjamin is a senior software developer with many years of experience in the presentation of bigdata for the telecom industry (mainly Kafka and HBase). Since early 2020, he has been an integral part of the YARN team at Cloudera. He gained general knowledge in YARN, and recently he started to specialise in Schedulers. He lives in Budapest and besides his interest in photography and cars, he is passionately automatizing his home via IoT.

Thursday 18:55 UTC
HDFS Migration from 2.7 to 3.3 and enabling Router Based Federation (RBF) in production
Akira Ajisaka

In a production HDFS cluster in Yahoo! JAPAN, the namespace has become too large and it won't fit in a single NameNode in the near future. Therefore we want to split the big namespace into some small one and use federation. There is an existing ViewFS solution but the clients need to add the mount table in their configs when using the ViewFS. Our clusters have too many clients, so we want to minimize the change of client configs. RBF is a new solution. In contrast to ViewFS, the Router manages the mount table, and clients don't have to set the mount table explicitly. In this talk, I will introduce the internals of RBF and how we configured the mount tables to load-balance among namespaces in production. RBF has some limitations. For example, rename (mv) operation is not allowed between different namespaces. I will talk about how we work around the limitations. In addition, the developers are going to eliminate some of the limitations in the community. I'll introduce the progressions as well. Next, I will introduce the improvements of recent HDFS. For example, multiple standby NameNodes, observer NameNodes, and DataNode maintenance mode are features that will greatly reduce the operation cost. I'm going to introduce them and how to enable those features. Upgrading from 2.7 to 3.3 is a big jump and we hit many incompatible changes. For the administrators who are going to upgrade their HDFS clusters, I would like to introduce the differences as many as possible.

Akira Ajisaka develops and validates some new features of Apache Hadoop such as HDFS Router Based Federation for our use. Also, he troubleshoots and improves management/operations in our Hadoop clusters. He maintenances Apache Hadoop to improve its quality as an Apache Hadoop/Yetus committer and PMC member.

Thursday 19:35 UTC
Apache Beam: using cross-language pipeline to execute Python code from Java SDK
Alexey Romanenko

There are many reasons why we would need to execute Python code in Java data processing pipelines (and vice versa) - e.g. Machine Learning libraries, IO connectors, user’s Python code - and several different ways to do that. With the End of Life of Python 2 started this year, it’s getting more challenging since not all old solutions still work well for Python 3. One of the potential options for this could be using a cross-language pipeline and Portable Runner in Apache Beam. In this talk I’m going to talk about what the cross-language pipeline in Beam is, how to create a mixed Java/Python pipeline, how to set up and run it, what kind of requirements and pitfalls we can expect in this case. Also, I’ll show a demo of a use case where we need to execute a custom user’s Python 3 code in the middle of Java SDK pipeline and run it with Portable Spark Runner.

Alexey Romanenko is Principal Software Engineer at Talend France, with more than 18 years of experience in software development. During his career, he has been working on very different projects, like high-load web services, web search engine and cloud storages. He is Apache Beam PMC member and committer, he contributed to different Beam IO components and Spark Runner.

Connect with us