The data lake is currently a popular big data storage and query solution. Based on the current mainstream data lake format such as DeltaLake, Hudi and Iceberg, it can realize the near real-time writing of large-scale data. It also supports Update and Delete operations, which can realize database bin-log, etc. CDC-type data sources enter the lake. Spark Structured Streaming is a stream processing framework based on the MinBatch execute mode, which can provide better throughput performance for near real-time writing for the data lake format. Through Spark Structured Streaming we can efficiently realize the near real-time write operation of CDC data. This article mainly want to talk about the application of Spark Structured Streaming in the scene of CDC data source entering the data lake, and the technical difficulties involved, including the performance optimization of real-time Merge, The multi-version problem of CDC data and the solution in the scenario of CDC schema change.
ZhiweiPeng: Currently working in the Alibaba EMR team, mainly engaged in the research and development of data lake formats such as Hudi, and Delta.