Schedule

Apache Avro to ORC Conversion

Jeff Evans from StreamSets

May 13th, 2019, 12:30 PM (duration: 20 minute)

Location: Emporium Logan Square Arcade Bar

Apache Avro and Apache ORC are two widely used formats for efficiently storing large amounts of data efficiently, which is useful in big data applications. While they are often found "under the hood" in tools like Hive, an increasing number of use cases require handling these files in a standalone manner (for example, in cloud storage environment). In this talk, I will discuss a motivating customer use case for converting Avro files to ORC. I will then walk through the standalone conversion code I wrote for our open source StreamSets Data Collector product to perform the conversion, without passing through Hive (as many solutions do). The discussion will close with a sequence of two short demos. In the first, I will demonstrate the conversion happening in a standalone command line process. In the second, I will show how the conversion can be performed on very large input files via MapReduce jobs on a Hadoop cluster.

A Case Study (What We Did) presented as Regular Talk in the Startups track.