Build streaming delta warehouse -- CDC based on data lake format


Chinese Session 2022-07-30 14:50 GMT+8  #streaming

With the rise and application of data lake format, it is necessary to continuously explore and enrich how to better integrate with the existing big data ecology in the actual production environment and solve the difficulties under the current big data/data warehouse architecture. This topic explores how to combine Apache Hudi and Apache Spark to implement a CDC solution to build a complete streaming incremental data warehouse in the classic data warehouse CDC scenario.


Yan Bi: Ali Cloud Intelligent - Computing Platform Business Division - open source big data platform, Technical experts, I worked in the open source big data department of Ali Cloud Computing platform, focusing on Apache Spark, Hudi and other open source projects, as well as the integration with Ali Cloud EMR and DLF products.