Adopting Apache Arrow at CloudQuery
We will walk through some of the architecture and design decisions that led us to adopt Apache Arrow as our in-memory type system.
Yevgeny Pats • Apr 26, 2023
This post is a collaboration with and cross-posted on the Arrow blog.
CloudQuery is an open source high performance ELT framework written in Go. We previously discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is one of the key components of a performant and scalable ELT framework where sources and destinations are decoupled. In this blog we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
Let’s quickly recap what type system is and why it is needed in an ELT framework. At a high level ELT framework extracts data from a source and moves it to a destination with a specific schema.
API ---> [Source Plugin] -----> [Destination Plugin] -----> [Destination Plugin] gRPC
Sources and destinations are decoupled and communicate via gRPC. This is crucial because this way we can add new destinations and update old destinations without updating source plugins code(which otherwise would introduce an unmaintainable architecture).
This is where the type system comes in. Source plugin extracts information from APIs in the most performant way, defines a schema and then transforms the result from the API (JSON or any other format) to a well-defined type system so the destination plugin will be able to easily create the schema for its database and transform the data to the destination type. So to recap, the source plugin sends mainly two things to a destination: 1) schema 2) data that fits the defined schema.
Before Arrow, we used our own type system that supported more than 14 types. This served us well, but we started to hit limitations in various use-cases. For example, in database to database replication, we needed to support many more types, including nested types. Also, performance-wise, lots of the time spent in an ELT process is around converting data from one format to another, so we wanted to take a step back and see if we can avoid this famous XKCD (by building yet another format):
This is where Arrow comes in. Apache arrow defines a cross language columnar format for flat and hierarchical data, and brings the following advantages:
- Performance: Arrow adoption is rising especially in columnar based databases (DuckDB, ClickHouse, BigQuery) and file formats (Parquet) which makes it easier to write CloudQuery destination or source plugins for databases that already support arrow as well as much more efficient as we remove the need for additional serialization and transformation step. Moreover, just the performance of sending arrow format from source plugin to destination is already more performant and memory efficient, given its “zero-copy” nature and not needing serialization/deserialization.
- Rich Data Types: Arrow supports more than 35 types including composite types (i.e. lists, structs and maps of all the available types) and ability to extend the type system with custom types.
Adopting Apache Arrow as the CloudQuery in-memory type system enables us to gain better performance, data interoperability and developer experience. Some plugins that are going to gain an immediate boost of rich type systems are our db->db replication plugins such as PostgreSQL CDC source plugin (and all database destinations) that are going to get support for all available types including nested ones.
We are excited about this step and joining the growing Arrow community. We already contributed more than 30 upstream pull requests that were quickly reviewed by the Arrow maintainers, thank you!