Tutorials

Moving Data from PostgreSQL to MotherDuck

Q: Do I Need to Change My PostgreSQL Configuration for CDC?

Yes. CDC requires `wal_level = logical` on your PostgreSQL server. If you're running PostgreSQL locally with Docker, pass `-c "wal_level=logical"` when starting the container. On AWS RDS, set `rds.logical_replication = 1` in a custom parameter group. On GCP Cloud SQL, set `cloudsql.logical_decoding = on`.

Q: Does the DuckDB Integration Support Windows?

Not currently. The [DuckDB destination plugin](https://github.com/cloudquery/cloudquery/issues/8854) does not support Windows due to dynamic linking requirements in the DuckDB Go library. It works on macOS and Linux.

Q: What Write Modes Does the DuckDB Destination Support?

The DuckDB destination supports three write modes: `overwrite-delete-stale` (default, replaces existing rows and removes stale ones), `overwrite` (replaces existing rows without deleting stale ones), and `append` (adds new rows without modifying existing data).

Q: Can I Sync Multiple Tables at Once?

Yes. The `tables` field in the source configuration accepts a list. You can specify individual table names or use wildcards like `['public.*']` to sync all tables in a schema.

Herman Schaaf

•

6 min read

MotherDuck is a serverless analytics service powered by DuckDB. If you have data in a PostgreSQL database that you'd like to analyze using MotherDuck, this tutorial walks through how to set that up with CloudQuery. We cover two approaches:

Copying a PostgreSQL database into MotherDuck as a one-off batch operation, and
Continuously streaming PostgreSQL data to MotherDuck using PostgreSQL Change Data Capture (CDC).

TL;DR: CloudQuery syncs data from PostgreSQL to MotherDuck with two modes: a one-off batch sync for ad hoc analysis, or continuous CDC streaming that mirrors changes in real time. Both use a single YAML configuration file and the CloudQuery CLI.

Which option you choose depends on your use case. Either way, by the end of this tutorial you'll be querying your PostgreSQL data in MotherDuck.

Moving Data from PostgreSQL to MotherDuck YouTube Video

The video above was recorded in 2023. Some configuration options have changed since then (notably CDC setup). The written instructions below reflect the latest syntax and parameters.

How Do You Create a Test Table in PostgreSQL? (Optional) #

To demonstrate the steps in this tutorial, we'll create a customers table in PostgreSQL and insert 10,000 generated rows:

select
  md5(random()::text) as id,
  'person' || num || '@' ||
  (case (random() * 2)::integer
    when 0 then 'gmail'
    when 1 then 'hotmail'
    when 2 then 'yahoo'
  end) || '.com' as email,
  now() as created_at
into customers
from generate_series(1,10000) as num;
alter table customers add primary key (id);

And let's test it out:

select * from customers limit 3;

+----------------------------------+---------------------+------------------------------+
| id                               | email               | created_at                   |
|----------------------------------+---------------------+------------------------------|
| 216016a4c29d66dd007cd1c9fa9994b4 | [email protected] | 2023-06-05 14:04:38.15548+01 |
| de75a87f5f797248fd47ffcb15b7fff1 | [email protected]   | 2023-06-05 14:04:38.15548+01 |
| 279838446e9890b00e92221f25bb0c73 | [email protected]   | 2023-06-05 14:04:38.15548+01 |
+----------------------------------+---------------------+------------------------------+
SELECT 3

How Do You Install CloudQuery? #

CloudQuery is a cross-platform CLI that can run locally, on a virtual machine, or in a containerized environment. In this tutorial we'll run it locally to copy data from a PostgreSQL database to MotherDuck. On macOS, install it using Homebrew:

brew install cloudquery/tap/cloudquery

See the CloudQuery Quickstart guide for installation instructions on other platforms. The DuckDB integration (used in this tutorial) does not support Windows at this time.

Once installed, verify it from the command line:

$ cloudquery --version
cloudquery version VERSION_CLI

How Do You Configure the Sync? #

The CloudQuery CLI is configured using YAML files. Create a new file called postgres-to-motherduck.yml with the following content:

kind: source
spec:
  name: 'postgresql'
  path: 'cloudquery/postgresql'
  registry: 'cloudquery'
  version: 'v6.13.10'
  destinations: ['motherduck']
  tables: ['customers']
  spec:
    connection_string: 'postgresql://postgres:pass@localhost:5432/cloudquery?sslmode=disable'
---
kind: destination
spec:
  name: 'motherduck'
  version: 'v6.3.8'
  registry: 'cloudquery'
  path: 'cloudquery/duckdb'
  write_mode: 'overwrite-delete-stale'
  migrate_mode: 'safe'
  spec:
    connection_string: 'md:'

A typical CloudQuery configuration file has two parts: a source and a destination. The PostgreSQL database is the source here, configured to copy the customers table from a locally running instance. The destination is the DuckDB integration, and instead of writing to a local file we've pointed it at MotherDuck. You can also use a named database, like md:cloudquery.

In production, use environment variable expansion for connection strings and secrets:

spec:
  connection_string: '${PG_CONNECTION_STRING}'

How Do You Authenticate with MotherDuck? #

To authenticate with MotherDuck, export your service token as an environment variable. You can find the token in the MotherDuck app by clicking your organization name in the top left, then navigating to Settings and creating a new token. See the MotherDuck documentation for the latest instructions on token management.

Copy the token and export it:

export motherduck_token=<INSERT YOUR TOKEN HERE>

You can also add this to your .bashrc or .zshrc so it loads automatically. If you do, remember to run source ~/.bashrc or source ~/.zshrc before continuing.

You can also pass the token directly in the configuration file using environment variable substitution:

spec:
  connection_string: 'md:my_db?motherduck_token=${MOTHERDUCK_TOKEN}'

How Do You Run the Sync? #

The last step is to start the sync. CloudQuery will automatically create tables in MotherDuck and begin loading data from PostgreSQL.

Option 1: Run a One-Off Batch Sync #

In your terminal, run:

cloudquery sync postgres-to-motherduck.yml

This loads all rows from PostgreSQL into MotherDuck and exits once the data has been copied. You should see output similar to the following:

Video walkthrough of running a one-off CloudQuery sync from PostgreSQL to MotherDuck

Option 2: Run a Continuous Sync with CDC #

To continuously stream changes from PostgreSQL to MotherDuck, add the cdc_id option to the source configuration. This enables PostgreSQL logical replication, which requires wal_level = logical on your PostgreSQL server.

kind: source
spec:
  name: 'postgresql'
  path: 'cloudquery/postgresql'
  registry: 'cloudquery'
  version: 'v6.13.10'
  destinations: ['motherduck']
  tables: ['customers']
  spec:
    cdc_id: 'postgresql' # <- Unique identifier for this CDC source
    connection_string: 'postgresql://postgres:pass@localhost:5432/cloudquery?sslmode=disable'
---
kind: destination
spec:
  name: 'motherduck'
  path: 'cloudquery/duckdb'
  registry: 'cloudquery'
  version: 'v6.3.8'
  write_mode: 'overwrite-delete-stale'
  migrate_mode: 'safe'
  spec:
    connection_string: 'md:cloudquery'

The cdc_id value must be a unique string across all your CloudQuery sources. It identifies this particular CDC stream.

Now start the sync:

cloudquery sync postgres-to-motherduck-cdc.yml

This time the sync runs continuously, and any changes made to the PostgreSQL database are automatically replicated to MotherDuck.

Video walkthrough of setting up continuous sync with Change Data Capture from PostgreSQL to MotherDuck

Taking It to Production #

So far we've been running syncs from a local machine, which is great for testing but not a long-term solution. For production workloads, you'll want to schedule the sync and run it in a managed environment. The CloudQuery Deployment Documentation covers options including Kubernetes, ECS, Cloud Run, GitHub Actions, and more.

For teams that want managed scheduling, monitoring, and collaboration without maintaining their own infrastructure, CloudQuery Platform handles all of this out of the box.

Try CloudQuery Platform

See how CloudQuery makes it straightforward to sync PostgreSQL data to MotherDuck and other destinations. Check out the DuckDB destination docs or browse all available integrations.

Schedule a Demo

Frequently Asked Questions #

What Is the Difference Between a One-Off Sync and CDC? #

A one-off sync copies all data from PostgreSQL to MotherDuck and exits. CDC (Change Data Capture) uses PostgreSQL logical replication to continuously stream changes as they happen, keeping MotherDuck in near-real-time sync with your source database. Use one-off syncs for ad hoc analysis or backfills, and CDC for ongoing analytics on live data.

Do I Need to Change My PostgreSQL Configuration for CDC? #

Yes. CDC requires wal_level = logical on your PostgreSQL server. If you're running PostgreSQL locally with Docker, pass -c "wal_level=logical" when starting the container. On AWS RDS, set rds.logical_replication = 1 in a custom parameter group. On GCP Cloud SQL, set cloudsql.logical_decoding = on.

Does the DuckDB Integration Support Windows? #

Not currently. The DuckDB destination plugin does not support Windows due to dynamic linking requirements in the DuckDB Go library. It works on macOS and Linux.

What Write Modes Does the DuckDB Destination Support? #

The DuckDB destination supports three write modes: overwrite-delete-stale (default, replaces existing rows and removes stale ones), overwrite (replaces existing rows without deleting stale ones), and append (adds new rows without modifying existing data).

Can I Sync Multiple Tables at Once? #

Yes. The tables field in the source configuration accepts a list. You can specify individual table names or use wildcards like ['public.*'] to sync all tables in a schema.

How Do I Monitor a Running CDC Sync? #

The CloudQuery CLI outputs log messages as it processes changes. For production deployments, CloudQuery Platform provides a dashboard with sync status, error tracking, and alerting built in.

CloudOps