Exploring API Data with DuckDB

Engineering

Exploring API Data with DuckDB

•

DuckDB has been making a splash in the data world, and for good reason. It's a fast, in-process SQL OLAP database management system that can be used to explore data in a variety of ways.

Looking at DuckDB's GitHub star history, it's clear that DuckDB has been gaining popularity rapidly:

I'm excited to share that CloudQuery has now added a new DuckDB destination plugin, so you can sync data from any CloudQuery source and explore it using DuckDB. But what is DuckDB? In this post I will introduce DuckDB, show how to sync data to a local DuckDB database, and then explore the data using the DuckDB CLI.

What is DuckDB? #

DuckDB describes itself as an in-process SQL OLAP database management system. That's a bit of a mouthful. DuckDB is perhaps most easily understood as being in the same category as SQLite, but focused on OLAP (online analytical processing) instead of OLTP (online transaction processing). As such, DuckDB is a columnar database, which means that it stores data in columns instead of rows. This makes it fast for aggregations and other analytical queries. Being embedded with a focus on OLAP helps it fill a niche that is not well served by other databases, as illustrated by the diagram below from the 2019 SIGMOD paper by Raasveldt & Mühleisen: DuckDB: an Embeddable Analytical Database:

The fact that it runs in-process means that it can be used in a wide variety of applications: from Jupyter notebooks to web browsers to command-line tools.

Exploring API Data with DuckDB #

By using the CloudQuery DuckDB destination plugin, we can explore data from a whole range of APIs from the comfort of a local DuckDB database. But since DuckDB is an "in-process" database, and CloudQuery usually writes to databases or data lakes, it might not be entirely clear what we mean by a CloudQuery destination plugin for DuckDB. Similar to SQLite, DuckDB databases can also be stored as files, so we can use the CloudQuery duckdb destination plugin to sync data to a local DuckDB database file. Let's try it out!

First, we need a source dataset to sync from. If you're familiar with CloudQuery, you probably already know about the extensive AWS, GCP and Azure plugins that can sync infrastructure data. DuckDB is a great fit for quick ad-hoc exploration of this data as well, but just for fun, today let's sync data from the Homebrew Analytics API and explore it a bit using DuckDB.

First we create a CloudQuery config file called homebrew-duckdb.yml, in which we'll put both the Homebrew source and the DuckDB destination configuration:

kind: source
spec:
  name: 'homebrew'
  path: 'cloudquery/homebrew'
  registry: 'cloudquery'
  version: 'v4.7.27'
  destinations: ['duckdb']
  tables:
    - homebrew_analytics_installs_30d
    - homebrew_analytics_installs_90d
    - homebrew_analytics_installs_365d
---
kind: destination
spec:
  name: 'duckdb'
  path: 'cloudquery/duckdb'
  registry: 'cloudquery'
  version: 'v6.1.2'
  spec:
    connection_string: './homebrew.duckdb'

Now we can run cloudquery sync to sync data from the Homebrew source to a local DuckDB database file (you will need to have CloudQuery installed for this step):

cloudquery sync homebrew-duckdb.yml

This should output something like the following:

cloudquery sync homebrew-duckdb.yml
Loading spec(s) from homebrew-duckdb.yml
Starting migration with 3 tables for: homebrew (v4.7.27) -> [duckdb (v6.1.2)]
Migration completed successfully.
Starting sync for: homebrew (v4.7.27) -> [duckdb (v6.1.2)]
Sync completed successfully. Resources: 30000, Errors: 0, Panics: 0, Time: 10s

Now we have a local DuckDB database file called homebrew.duckdb that contains the data from the Homebrew Analytics API. We can use the DuckDB CLI to explore the data. Check out the DuckDB Installation page for more information on how to install the CLI. Let's open the database:

duckdb homebrew.duckdb

This outputs the following and then waits for input:

v0.7.1 b00b93f0b1
Enter ".help" for usage hints.
D

DuckDB is great for data exploration, so let's explore some data! Let's start by looking at the homebrew_analytics_installs_30d table:

SELECT
    number, formula, count, percent
FROM homebrew_analytics_installs_30d
ORDER BY number ASC
LIMIT 10;

Now we can see the top 10 most installed Homebrew packages (on MacOS) in the last 30 days:

┌────────┬─────────────────┬────────┬─────────┐
│ number │     formula     │ count  │ percent │
│ int64  │     varchar     │ int64  │  float  │
├────────┼─────────────────┼────────┼─────────┤
│      1 │ [email protected]     │ 313469 │    2.37 │
│      2 │ libnghttp2      │ 258853 │    1.96 │
│      3 │ zstd            │ 255983 │    1.94 │
│      4 │ [email protected]     │ 238906 │    1.81 │
│      5 │ ca-certificates │ 219814 │    1.67 │
│      6 │ freetype        │ 207035 │    1.57 │
│      7 │ xz              │ 204572 │    1.55 │
│      8 │ jpeg-turbo      │ 203761 │    1.54 │
│      9 │ libxcb          │ 202785 │    1.54 │
│     10 │ harfbuzz        │ 176521 │    1.34 │
├────────┴─────────────────┴────────┴─────────┤
│ 10 rows                           4 columns │
└─────────────────────────────────────────────┘

Great! This proves it works, but it's a bit boring, in my opinion. What if we wanted to see the fastest-growing Homebrew packages instead?

We can combine the homebrew_analytics_installs_30d and homebrew_analytics_installs_90d tables to find out which packages have gone up in popularity the most. We shouldn't use the count column for this, because Homebrew recently added and publicized a new way to control the analytics that gets sent. This means that the count column, which is an absolute count of installs, is being skewed by more users turning off analytics in the last month. Instead, we will use the number column, which is a relative ranking of popularity. If we do this, we should also make sure we only look at formulae that had a relatively large number of installs to begin with, otherwise a small gain in installs can lead to a large jump in ranking for long-tail formulae. Let's try it out:

SELECT
  a.formula,
  'https://formulae.brew.sh/formula/' || a.formula as url,
  a.number AS number_30d,
  b.number AS number_90d,
  (a.number - b.number) AS change_90d
FROM homebrew_analytics_installs_30d a
JOIN homebrew_analytics_installs_90d b ON a.formula = b.formula
WHERE
    -- only consider formulae that had at least 500 installs in the last 90 days
    b.count > 500
    -- only consider formulae that are not custom taps
    AND NOT contains(a.formula, '/')
    -- ignore specific version installs
    AND NOT contains(a.formula, '@')
ORDER BY change_90d ASC
LIMIT 15;

And the results are:

formula	url	number_30d	number_90d	change_90d
spandsp	https://formulae.brew.sh/formula/spandsp	1256	2347	-1091
aws-console	https://formulae.brew.sh/formula/aws-console	1222	2156	-934
megatools	https://formulae.brew.sh/formula/megatools	1376	2282	-906
librsync	https://formulae.brew.sh/formula/librsync	892	1797	-905
traefik	https://formulae.brew.sh/formula/traefik	1379	2284	-905
noti	https://formulae.brew.sh/formula/noti	1350	2197	-847
solr	https://formulae.brew.sh/formula/solr	1297	2070	-773
checkbashisms	https://formulae.brew.sh/formula/checkbashisms	1485	2233	-748
jd	https://formulae.brew.sh/formula/jd	1115	1792	-677
terraform-ls	https://formulae.brew.sh/formula/terraform-ls	1028	1695	-667
trash-cli	https://formulae.brew.sh/formula/trash-cli	1096	1756	-660
zola	https://formulae.brew.sh/formula/zola	1208	1866	-658
grails	https://formulae.brew.sh/formula/grails	1515	2167	-652
hubble	https://formulae.brew.sh/formula/hubble	1711	2335	-624
libxmp	https://formulae.brew.sh/formula/libxmp	550	1156	-606

To get the output in Markdown format (like I did for this post), you can run .mode markdown before running the query.

There are some interesting tools in that list! I'll be honest, knowing DuckDB's rise in popularity, I was expecting to see it in this list as well. Let's run a quick query to see how it did:

SELECT
  a.formula,
  'https://formulae.brew.sh/formula/' || a.formula as url,
  a.number AS number_30d,
  b.number AS number_90d,
  (a.number - b.number) AS change_90d
FROM homebrew_analytics_installs_30d a
JOIN homebrew_analytics_installs_90d b ON a.formula = b.formula
WHERE
    a.formula = 'duckdb';

formula	url	number_30d	number_90d	change_90d
duckdb	https://formulae.brew.sh/formula/duckdb	1709	1937	-228

Perhaps not in the top 15, but moving up 228 places in ranking is still impressive! It also backs up the meteoric rise in popularity that we observed on the GitHub Stars graph earlier.

If you'd like to explore the most up-to-date data yourself, you can follow the steps above and try out some of your own queries. You can also check out the DuckDB documentation for more information on how to use DuckDB. A word of caution on the Homebrew Analytics API dataset: I suspect that these Homebrew rankings can be easily gamed or accidentally skewed by tools being installed as part of automated processes (like CI). It will also be influenced by the target demographic's likelihood to have disabled analytics. That said, it's still an interesting way to discover up-and-coming new tools; just don't take the numbers here as the final word on the matter.

Conclusion #

In this post, I introduced DuckDB and the new CloudQuery DuckDB destination that allows you to load data from many APIs into a DuckDB-compatible format. I also showed you one specific use case: using DuckDB to query the Homebrew Analytics API and discover new tools that are gaining popularity. From here, we could take the analysis in a million different directions: perhaps by combining it with data from other APIs or datasets? Or graphing the popularity trends? Classifying Homebrew tools with ChatGPT or GPT4 and creating separate rankings by class? I will leave these as exercises for the reader.

Another interesting next step would be to allow CloudQuery to be run as an embedded process in the same way as DuckDB, so that both could be run from the same application, like Jupyter Notebooks. This would certainly make for a powerful combination, and it's something I'd like to explore more. Right now it would be a two-step process: sync to a local DuckDB file first, then load it from the notebook.

I'm looking forward to seeing how CloudQuery and DuckDB get used and what the community comes up with! If you have any questions or feedback, feel free to reach out to us on GitHub or our Community.

Ready to get started with CloudQuery? You can try out CloudQuery locally with our quick start guide or explore the CloudQuery Platform (currently in beta) for a more scalable solution.

Want help getting started? Join the CloudQuery community to connect with other users and experts, or message our team directly here if you have any questions.

CloudQuery

Exploring API Data with DuckDB

What is DuckDB? #

Exploring API Data with DuckDB #

Conclusion #

Related posts

How to Work with CloudQuery Syncs with Snowflake and Backstage

How to Work with CloudQuery Syncs with Snowflake and Backstage