Technical tips for secure Apache Hadoop cluster

Akira Ajisaka, Kei KORI

English Session 2021-08-08 13:30 GMT+8  (ROOM : A) #bigdata

It is well known that Apache Hadoop is not secure by default, and it is easy to impersonate a privileged user if Kerberos is not enabled. In this talk, we will introduce further technical details about security. We will talk about the current “open-source” software capability for some security features in a vendor-neutral manner. We mainly introduce SSL/TLS for Hadoop and its related products (Apache Spark, Apache Hive, Apache ZooKeeper, etc.) and introduce HDFS data encryption. Regarding HDFS data encryption, we implemented Hadoop Credential Provider API to integrate Hadoop KMS with our internal credential store to store the secrets for encrypting/decrypting data encryption key. We also introduce the internals of HDFS data encryption and how to create your custom credential providers.

In addition, we will introduce the automatic keystore reloading, which is the latest feature of Apache Hadoop, and it enables updating SSL certificates without stopping the server. The feature is useful for operation.


Akira Ajisaka: Akira Ajisaka is a senior software engineer at Yahoo Japan Corporation. He develops and validates some new features of Apache Hadoop for our use. Also, he troubleshoots and improves management/operations in our Hadoop clusters. He maintenances Apache Hadoop to improve its quality as an Apache Hadoop/Yetus committer and PMC member.

Kei KORI: Kei KORI is a data platform engineer on the Hadoop development team and Kubernetes team at Yahoo Japan Corporation, where he works to deliver secure Hadoop clusters and their user ecosystem. He has over years of experience in the administration of Hadoop clusters and Kubernetes clusters with expertise in automation and continuous delivery of distributed systems.