Advanced Topics
Performance Tuning

Performance Tuning

This page contains a number of tips and tricks for improving the performance of cloudquery sync for large cloud estates.

Use Wildcard Matching

Sometimes the easiest way to improve the performance of the sync command is to limit the number of tables that get synced. The tables and skip_tables source config options both support wildcard matching. This means that you can use * anywhere in a name to match multiple tables.

For example, when using the aws source plugin, it is possible to use a wildcard pattern to match all tables related to AWS EC2:

 - aws_ec2_*

This can also be combined with skip_tables. For example, let's say we want to include all EC2 tables, but not EBS-related ones:

- "aws_ec2_*"
- "aws_ec2_ebs_*"

The CloudQuery CLI will warn if a wildcard pattern does not match any known tables.

Improve Performance by Skipping Relations

Some tables require many API calls to sync. This is especially true of tables that depend on other tables, because often multiple API calls need to be made for every row in the parent table. This can lead to thousands of API calls, increasing the time it takes to sync. If you know that some child tables are not strictly necessary, you can skip them by default by setting the skip_dependent_tables source config option to true:

kind: source
  skip_dependent_tables: true

Setting skip_dependent_tables: true will cause tables that depend on other tables to not be synced unless they are selected by the tables setting.

Let's say we have three tables: A, B and C. A is the top-level table. B depends on it, and C depends on B:

↳ B
  ↳ C

We might want table A, but not need table B or C. We have two options:

  1. Set skip_dependent_tables: true

    tables: ["A"]
    skip_dependent_tables: true

    This skips B and C by not syncing dependent tables unless they are explicitly included.


  2. Use skip_tables:

    tables: ["A"]
    skip_tables: ["B"]

    Note how this skips both B and C by skipping B: C won't be included because it depends on B. You can also explicitly skip both B and C.

Tune Concurrency

The concurrency setting, available for all source plugins as part of the source spec, controls the approximate number of concurrent requests that will be made while performing a sync. Setting this to a low number will reduce the number of concurrent requests, reducing the memory used and making the sync less likely to hit rate limits. The trade-off is that syncs will take longer to complete.

Use a Different Scheduler

By default, CloudQuery syncs will fetch all tables in parallel, writing data to the destination(s) as they come in. However, the concurrency setting, mentioned above, places a limit on how many table-clients can be synced at a time. What "table-client" means depends on the source plugin and the table. In AWS, for example, a client is usually a combination of account and region. Get all the combinations of accounts and regions for all tables, and you have all the table-clients for a sync. For the GCP source plugin, clients generally map to projects.

The default CloudQuery scheduler, known as dfs, will sync up to concurrency / 100 table-clients at a time (we are ignoring child relations for the purposes of this discussion). Let's take an example GCP cloud estate with 5000 projects, syncing 100 tables. This makes for approximately 500,000 table-client pairs, and a concurrency of 10,000 will allow 100 table-client pairs to be synced at a time. The dfs scheduler will start with the first table and its first 100 projects, and then move on to finish all projects for that table before moving on to the next table. This means, in practice, only one table is really being synced at a time!

Usually this works out fine, as long as the cloud platform's rate limits are aligned with the clients. But if rate limits are applied per-table, rather than per-project, dfs can be suboptimal. A better strategy in this case would be to choose the first client for every table before moving on to the next client. This is what the round-robin scheduler does.

Only some plugins support this setting. The following example config enables round-robin scheduling for the GCP source plugin:

kind: source
  name: "gcp"
  path: "cloudquery/gcp"
  version: "v9.5.4"
  tables: ["gcp_storage_*", "gcp_compute_*"]
  destinations: ["postgresql"]
    scheduler: "round-robin"
    project_ids: ...

Finally, the shuffle strategy aims to provide a balance between dfs and round-robin by randomizing the order in which table-client pairs are chosen. The following example enables shuffle for the AWS plugin, which can help reduce the likelihood of hitting rate limits by randomly mixing the underlying services to which API calls that are made concurrently, rather than hitting a single API with all calls at once:

kind: source
  name: "aws"
  path: "cloudquery/aws"
  version: "v22.12.0"
  tables: ["aws_account_*", "aws_s3_*", "aws_ec2_*"]
  destinations: ["postgresql"]
    scheduler: "shuffle"
    # ...