File Plugin documentation

Official

File

This destination plugin lets you sync data from a CloudQuery source to local files in various formats. It currently supports CSV, line-delimited JSON and Parquet

Make sure to sign up and run cloudquery login before running your first sync.

Publisher

cloudquery

Repositorygithub.com

Latest version

v5.0.0

Type

Destination

Platforms

Date Published

May 7, 2024

Price

Free

Documentation Changelog

Overview

File Destination Plugin

This destination plugin lets you sync data from a CloudQuery source to local files in various formats. It currently supports CSV, line-delimited JSON and Parquet.

This plugin is useful in local environments, but also in production environments where scalability, performance and cost are requirements. For example, this plugin can be used as part of a system that syncs sources across multiple virtual machines, uploads Parquet files to a remote storage (such as S3 or GCS), and finally loads them to data lakes such as BigQuery or Athena in batch mode. If this is your end goal, you may also want to look at more specific destination cloud storage destination plugins such as S3, GCS or Azure Blob Storage.

Example

This example configures the file destination, to create CSV files in ./cq_csv_output. You can also choose json or parquet as the output format.

kind: destination
spec:
  name: "file"
  path: "cloudquery/file"
  registry: "cloudquery"
  version: "v5.0.0"
  write_mode: "append"
  spec:
    path: "path/to/files/{{TABLE}}/{{UUID}}.{{FORMAT}}"
    format: "parquet" # options: parquet, json, csv
    # Optional parameters
    # format_spec:
      # CSV-specific parameters:
      # delimiter: ","
      # skip_header: false
    # compression: "" # options: gzip
    # no_rotate: false
    # batch_size: 10000
    # batch_size_bytes: 52428800 # 50 MiB
    # batch_timeout: 30s

Note that the file plugin only supports append write_mode. The (top level) spec section is described in the Destination Spec Reference.

File Spec

This is the (nested) spec used by the file destination Plugin.

path (string) (required)
Path template string that determines where files will be written, for example path/to/files/{{TABLE}}/{{UUID}}.parquet.
The path supports the following placeholder variables:
- {{TABLE}} will be replaced with the table name
- {{FORMAT}} will be replaced with the file format, such as csv, json or parquet. If compression is enabled, the format will be csv.gz, json.gz etc.
- {{UUID}} will be replaced with a random UUID to uniquely identify each file
- {{YEAR}} will be replaced with the current year in YYYY format
- {{MONTH}} will be replaced with the current month in MM format
- {{DAY}} will be replaced with the current day in DD format
- {{HOUR}} will be replaced with the current hour in HH format
- {{MINUTE}} will be replaced with the current minute in mm format
Note that timestamps are in UTC and will be the current time at the time the file is written, not when the sync started.
format (string) (required)
Format of the output file. Supported values are csv, json and parquet.
format_spec (format_spec) (optional)
Optional parameters to change the format of the file.
no_rotate (boolean) (optional) (default: false)
If set to true, the plugin will write to one file per table. Otherwise, for every batch a new file will be created with a different .<UUID> suffix.
compression (string) (optional) (default: "")
Compression algorithm to use. Supported values are "" and gzip. Not supported for parquet format.
batch_size (integer) (optional) (default: 10000)
Number of records to write before starting a new file.
batch_size_bytes (integer) (optional) (default: 52428800 (50 MiB))
Number of bytes (as Arrow buffer size) to write before starting a new file.
batch_timeout (duration) (optional) (default: 30s (30 seconds))
Maximum interval between batch writes.

format_spec

delimiter (string) (optional) (default: ,)
Character that will be used as want to use as the delimiter if the format type is csv.
skip_header (boolean) (optional) (default: false)
Specifies if the first line of a file should be the headers (when format is csv).