BigQuery Plugin documentation

Official

BigQuery

The BigQuery plugin syncs data from any CloudQuery source plugin(s) to a BigQuery database running on Google Cloud Platform

Make sure to sign up and run cloudquery login before running your first sync.

Publisher

cloudquery

Repositorygithub.com

Latest version

v3.5.5

Type

Destination

Platforms

Date Published

April 30, 2024

Price

Free

Documentation Changelog

Overview Types

Overview

BigQuery Destination Plugin

The BigQuery plugin syncs data from any CloudQuery source plugin(s) to a BigQuery database running on Google Cloud Platform.

The plugin currently only supports a streaming mode through the legacy streaming API. This is suitable for small- to medium-sized datasets, and will stream the results directly to the BigQuery database. A batch mode of operation is being developed to support larger datasets, but this is not currently supported.

Streaming is not available for the Google Cloud free tier.

Before you begin

Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
Create a BigQuery dataset that will contain the tables synced by CloudQuery. CloudQuery will automatically create the tables as part of a migration run on the first sync.
Ensure that you have write access to the dataset. See Required Permissions for details.

Example config

kind: destination
spec:
  name: bigquery
  path: cloudquery/bigquery
  registry: cloudquery
  version: "v3.5.5"
  write_mode: "append"
  spec:
    project_id: ${PROJECT_ID}
    dataset_id: ${DATASET_ID}
    # Optional parameters
    # dataset_location: ""
    # time_partitioning: none # options: "none", "hour", "day"
    # service_account_key_json: ""
    # endpoint: ""
    # batch_size: 10000
    # batch_size_bytes: 5242880 # 5 MiB
    # batch_timeout: 10s

This example above expects the following environment variables to be set:

PROJECT_ID - The Google Cloud Project ID
DATASET_ID - The Google Cloud BigQuery Dataset ID

The BigQuery destination utilizes batching, and supports batch_size and batch_size_bytes.

Note that the BigQuery plugin only supports the append write mode.

Authentication

The BigQuery plugin authenticates using your Application Default Credentials. Available options are all the same options described here in detail:

Local Environment:

gcloud auth application-default login (recommended when running locally)

Google Cloud cloud-based development environment:

When you run on Cloud Shell or Cloud Code credentials are already available.

Google Cloud containerized environment:

When running on GKE use workload identity.

Google Cloud services that support attaching a service account:

Services such as Compute Engine, App Engine and functions supporting attaching a user-managed service account which will CloudQuery will be able to utilize.

On-premises or another cloud provider

The suggested way is to use Workload identity federation
If not available you can always use service account keys and export the location of the key via GOOGLE_APPLICATION_CREDENTIALS. (Not recommended as long-lived keys are a security risk)

BigQuery Spec

This is the top-level spec used by the BigQuery destination plugin.

project_id (string) (required)
The id of the project where the destination BigQuery database resides.

dataset_id (string) (required)
The name of the BigQuery dataset within the project, e.g. my_dataset. This dataset needs to be created before running a sync or migration.

dataset_location (string) (optional)
The data location of the BigQuery dataset. If set, will be used as the default location for job operations. Pro-tip: this can solve "dataset not found" issues for newly created datasets.

time_partitioning (string) (options: none, hour, day) (default: none)
The time partitioning to use when creating tables. The partition time column used will always be _cq_sync_time so that all rows for a sync run will be partitioned on the hour/day the sync started.
service_account_key_json (string) (optional) (default: empty).
GCP service account key content. This allows for using different service accounts for the GCP source and BigQuery destination. If using service account keys, it is best to use environment or file variable substitution.
endpoint (string) (optional)
The BigQuery API endpoint to use. This is useful for testing against a local emulator.
batch_size (integer) (optional) (default: 10000)
Number of records to write before starting a new object.
batch_size_bytes (integer) (optional) (default: 5242880 (5 MiB))
Number of bytes (as Arrow buffer size) to write before starting a new object.
batch_timeout (duration) (optional) (default: 10s (10 seconds))
Maximum interval between batch writes.

Underlying library

We use the official cloud.google.com/go/bigquery package for database connection.

Types

BigQuery Types

The BigQuery destination (v3.0.0 and later) supports most Apache Arrow types. The following table shows the supported types and how they are mapped to BigQuery data types.

Arrow Column Type	Supported?	BigQuery Type
Binary	✅ Yes	`BYTES`
Boolean	✅ Yes	`BOOL`
Date32	✅ Yes	`DATE`
Date64	✅ Yes	`DATE`
Decimal	✅ Yes	`BIGNUMERIC`
Dense Union	❌ No
Dictionary	❌ No
Duration[ms]	✅ Yes	`INT64`
Duration[ns]	✅ Yes	`INT64`
Duration[s]	✅ Yes	`INT64`
Duration[us]	✅ Yes	`INT64`
Fixed Size List	❌ No
Float16	❌ No
Float32	✅ Yes	`FLOAT64`
Float64	✅ Yes	`FLOAT64`
Inet	✅ Yes	`STRING`
Int8	✅ Yes	`INT64`
Int16	✅ Yes	`INT64`
Int32	✅ Yes	`INT64`
Int64	✅ Yes	`INT64`
Interval[DayTime]	✅ Yes	`RECORD<days: INT64, milliseconds: INT64>`
Interval[MonthDayNano]	✅ Yes	`RECORD<months: INT64, days: int64, nanoseconds: int64>`
Interval[Month]	✅ Yes	`RECORD<months: INT64>`
JSON	✅ Yes	`JSON`
Large Binary	✅ Yes	`BYTES`
Large List	❌ No
Large String	✅ Yes	`STRING`
List	✅ Yes	(Repeated column) †
MAC	✅ Yes	`STRING`
Map	❌ No
String	✅ Yes	`STRING`
Struct	✅ Yes	`RECORD`
Timestamp[ms]	✅ Yes	`TIMESTAMP`
Timestamp[ns]	✅ Yes	`RECORD<timestamp: TIMESTAMP, nanoseconds: INT64>`
Timestamp[s]	✅ Yes	`TIMESTAMP`
Timestamp[us]	✅ Yes	`TIMESTAMP`
UUID	✅ Yes	`STRING`
Uint8	✅ Yes	`INT64`
Uint16	✅ Yes	`INT64`
Uint32	✅ Yes	`INT64`
Uint64	✅ Yes	`NUMERIC`
Union	❌ No

Notes

† Repeated columns in BigQuery do not support null values. Right now, if an array contains null values, these null values will be dropped when writing to BigQuery. Also, because we use REPEATED columns to represent lists, lists of lists are not supported right now.