Embracing Efficiency: Transitioning to a Single Deterministic Primary Key for Faster And Easier Updates and Fixes!
Changes to Primary Keys in the AWS Plugin
Ben Bernays • Feb 05, 2024
We have heard from many users that it can be difficult to keep up with the latest version of Source Plugins. This is especially true for those users that are building business critical applications on top of CloudQuery data and cannot afford to drop tables to do migrations. Today we are happy to announce that starting with version
v24.0.0 of the AWS Plugin we will be moving away from Compound Primary Keys to using just the
_cq_id field as the only Primary key. Prior to this release any change to a primary key necessitated a major version bump, going forward plugin developers that use this new capability will no longer require a schema change to alter the Primary key, making plugin developers able to release fixes to Primary Keys faster with less user impact.
Primary Keys are an important capability for CloudQuery Tables as primary keys enable users to be confident that data is not being duplicated even as CloudQuery is scaling to handle hundreds of thousands of concurrent API calls. Unfortunately for plugin developers API documentation rarely is explicit about what fields determine the uniqueness of a resource, as a result of that plugin developers are forced to make assumptions about the Primary Keys. Wrong assumptions can lead to duplicated data or even worse, erroneously dropped columns. Because data integrity is one of the most critical aspects of any ETL solution developers will prioritize releasing a major version to help users.
This functionality in the AWS plugin is based an added capability in the open source Go Plugin SDK and is available to all developers writing plugins in Go. If you are writing plugins with one of the other SDKs (Python, TypeScript, or Java) and you are interested in this capability, please reach out to us so we can be sure to prioritize adding this capability to those SDKs.
- Easier Updates : With a single deterministic primary key, that doesn't change from version to version. You can be confident that Source Plugin Updates won't require a schema change on your end.
- CloudQuery Spec Configs: If you are using the AWS Plugin and have set the
pk_mode: cq-id-onlyoptions then you will see no change in behavior. In this case you can remove those options from your spec and the plugin will continue to work as expected.
- Adoption: If you are using the latest version (
v7.3.1) of the Postgres Destination plugin it will handle all of the migrations for you. If you are using an older version of the Postgres Destination plugin or any other destination that support write modes other than
appendyou will need to manually update your schema to remove the compound primary key and add the
_cq_idfield as the primary key. Users can set
migrate_mode: trueand CloudQuery will migrate the table by dropping the existing table and remaking the table with the improved schema.
- Performance: We expect that for most destinations that support Primary Keys this change will have a negligible or small positive impact on sync times. ****Depending on your queries you might see an increase in latency if you were previously utilizing an index. In these cases you can manually add indexes to your data to improve performance.
tl;dr: Primary Keys that are misconfigured can lead to duplicate rows, or worse, missing rows. Plugin developers therefore prioritize fixes to Primary Keys, but until now this could only be done via a breaking change. After this change, plugin developers can safely fix Primary Keys in a minor release.
If you have any questions or concerns please reach out to us on our Discord