Scale GitHub Issues semantic search with CloudQuery, BigQuery, and Vertex AI

Engineering

Tutorials

Scale GitHub Issues semantic search with CloudQuery, BigQuery, and Vertex AI

•

Backlogs grow fast. Finding the themes behind requests, bugs, and "nice to haves" shouldn't require spelunking. In our previous post, we built semantic issue search with PostgreSQL + pgvector and OpenAI — great if you want to run it yourself and keep everything in your database. This article shows the fully managed path on Google Cloud: BigQuery stores and searches your embeddings, Vertex AI generates them and answers with Gemini. Same outcome; far less ops, elastic scale, and pay‑per‑query simplicity. If you prefer the self‑hosted route, check out the pgvector/OpenAI version here: Improve GitHub Issues search with CloudQuery, PgVector and OpenAI.

What we’ll build #

Sync: CloudQuery pulls open issues from the cloudquery/cloudquery GitHub repository into BigQuery.
Embed: As part of the sync, BigQuery calls Vertex AI through a remote model to create text embeddings and store them alongside chunks on a separate table.
Ask: A small Python script takes a question, embeds it, searches against existing embeddings with VECTOR_SEARCH in BigQuery, and has Gemini write the answer based on retrieved snippets.

Prerequisites #

CloudQuery CLI: Install from the site (cloudquery.io). If above free-tier usage, a CLOUDQUERY_API_KEY is required.

GitHub Access Token: PAT with repository read access for fetching issues.
GCP project & dataset: We’ll call it ai-playground. Also a BigQuery dataset named github_issues needs to be created.
APIs enabled in the project:
- BigQuery API
- BigQuery Connection API
- Vertex AI API
IAM permissions:
- For you (the auth’d user) within the project:
  
  roles/bigquery.admin
  
  roles/resourcemanager.projectIamAdmin
- For the BigQuery connection’s service account (we'll create this soon): Vertex AI User role

See also: BigQuery ML remote models (reference), remote model tutorial (guide), text embeddings overview (guide), and embeddings API reference (guide).

Step 1 — Create a BigQuery remote model to Vertex AI #

It’s a bit of setup, but once you get here the rest is trivial. We need a BigQuery dataset containing a remote model that points at Vertex AI’s embeddings endpoint.

Open your GCP project

Create a BigQuery dataset (or pick an existing one, e.g., github_issues)

Fill out the dataset form (choose location and defaults)

Enable required APIs

In BigQuery, click "+ Add data" to start creating a connection

Add a Vertex AI connection (Business Applications → Vertex AI Models: BigQuery Federation)

Grant the generated service account Vertex AI User in IAM

Create the remote model in your dataset (name it textembedding)

Note that the project ID is not the project name (which is ai-playground in this case). GCP picked triple-shift-469512-k6 as ID for this project.

CREATE OR REPLACE MODEL `triple-shift-469512-k6.github_issues_dataset.textembedding`
REMOTE WITH CONNECTION `projects/triple-shift-469512-k6/locations/us/connections/bqml_connection`
OPTIONS (
  ENDPOINT = 'gemini-embedding-001'
);

If this succeeds, you’re ready for embeddings directly from SQL. Note the model name (i.e. textembedding) is a friendly name for the remote model in BigQuery's dataset, not the name of the model in Vertex AI.

Step 2 — CloudQuery config #

Create github_to_bigquery.yaml with the following content:

kind: source
spec:
  name: github
  path: cloudquery/github
  registry: cloudquery
  version: "v14.1.1"
  tables: ["github_issues"]
  destinations: ["bigquery"]
  spec:
    access_token: "${GITHUB_TOKEN}" # Personal Access Token, required if not using App Authentication.
    repos: ["cloudquery/cloudquery"]  
    table_options:
      github_issues:
        state: "open" # e.g. "open, all, closed", defaults to `all`

---
kind: destination
spec:
  name: bigquery
  path: cloudquery/bigquery
  registry: cloudquery
  version: "v4.3.9"
  write_mode: "append"
  spec:
    project_id: ai-playground
    dataset_id: github_issues

    text_embeddings:
      tables:
        - source_table_name: github_issues
          target_table_name: github_issues_embeddings
          embed_columns: ["title", "body"] # list of columns to embed
      text_splitter:
        recursive_text: # default text splitter
          chunk_size: 1000
          chunk_overlap: 500
      remote_model_name: "textembedding"

Notes:

GITHUB_TOKEN and any other environment variables will be expanded by CloudQuery when the sync is run.
Your environment must be authenticated against the GCP project. Instructions for this can be found in the BigQuery destination plugin docs.

Run the sync:

cloudquery sync github_to_bigquery.yaml

Example successful output:

$ cloudquery sync github_to_bigquery.yaml
Loading spec(s) from github_to_bigquery.yaml
Starting sync for: github (cloudquery/[email protected]) -> [bigquery (cloudquery/[email protected])]
Sync completed successfully. Resources: 157, Errors: 0, Warnings: 0, Time: 31s

Step 3 — Ask questions (RAG) with a tiny Python script #

The script below embeds your query via ML.GENERATE_EMBEDDING, searches your embeddings table with VECTOR_SEARCH (cosine distance), and asks Gemini to answer using the retrieved chunks.

from google import genai
from google.cloud import bigquery
import sys
import os
from dotenv import load_dotenv

load_dotenv()

PROJECT_ID = os.getenv("PROJECT_ID")
LOCATION = os.getenv("LOCATION")
DATASET_ID = os.getenv("DATASET_ID")
TARGET_TABLE_ID = os.getenv("TARGET_TABLE_ID", "github_issues_embeddings")
REMOTE_MODEL_NAME = os.getenv("REMOTE_MODEL_NAME", "textembedding")
GEMINI_MODEL = os.getenv("GEMINI_MODEL", "gemini-2.5-flash")
TOP_K = int(os.getenv("TOP_K", "5"))

def generate_query_embedding(query_text):
    """Generate embedding for the search query using the remote model."""
    
    embedding_query = f"""
    SELECT
      ml_generate_embedding_result AS embedding
    FROM
      ML.GENERATE_EMBEDDING(
        MODEL `{PROJECT_ID}.{DATASET_ID}.{REMOTE_MODEL_NAME}`,
        (
          SELECT '{query_text}' AS content, 'query' AS title
        )
      )
    """
    
    client = bigquery.Client()
    query_job = client.query(embedding_query)
    results = query_job.result()
    
    for row in results:
        return row.embedding
    
    raise ValueError("No embedding generated for query")

def search_documents(query_text, top_k=None):
    """Search for relevant documents using VECTOR_SEARCH with cosine distance."""
    
    if top_k is None:
        top_k = TOP_K
    
    client = bigquery.Client()
    query_embedding = generate_query_embedding(query_text)
    search_query = f"""
    SELECT
      base.chunk_id as chunk_id,
      base.chunk_text as chunk_text
    FROM VECTOR_SEARCH(
      TABLE `{PROJECT_ID}.{DATASET_ID}.{TARGET_TABLE_ID}`,
      'embedding',
      (
        SELECT ARRAY<FLOAT64>{query_embedding} AS embedding
      ),
      'embedding',
      top_k => {top_k},
      distance_type => 'COSINE'
    )
    """
    
    query_job = client.query(search_query)
    results = query_job.result()

    return [
        {
            'chunk_id': row.chunk_id,
            'chunk_text': row.chunk_text,
        }
        for row in results
    ]

def generate_response(query_text, search_results):
    """Generate a response using Gemini based on search results."""
    
    if not search_results:
        return "I couldn't find any relevant information to answer your question."
    
    client = genai.Client(vertexai=PROJECT_ID, project=PROJECT_ID, location=LOCATION)
    
    context = "\n\n".join([
        f"(chunk {result['chunk_id']}): {result['chunk_text']}"
        for result in search_results
    ])
    
    prompt = f"""
    Based on the following context, please answer the user's question. 
    If the context doesn't contain enough information to answer the question, 
    please say so clearly.
    
    Context:
    {context}
    
    User Question: {query_text}
    
    Please provide a comprehensive answer based on the context provided.
    """
    
    response = client.models.generate_content(model=GEMINI_MODEL, contents=prompt)
    return response.text

def main():
    query = sys.argv[1]
    print(generate_response(query, search_documents(query)))

if __name__ == "__main__":
    main()

Let's test it to see how our BigQuery plugin is doing!

$ python search.py "Summarise any feature requests for the BigQuery destination plugin"
Based on the provided context, here are the feature requests for the BigQuery destination plugin:

1.  **Add overwrite or overwrite-delete-stale write modes:**
    *   **Problem:** Currently lacks support for incremental syncs in the BigQuery destination plugin.
    *   **Solution:** Introduce `write` or `overwrite-delete-stale` write modes to enable incremental data synchronization.

2.  **Full CDC (Change Data Capture) support for PostgreSQL -> BigQuery:**
    *   **Problem:** Existing CDC support from PostgreSQL to BigQuery is partial, specifically missing handling for `DELETE` operations, `UPDATE` operations, and DML (Data Manipulation Language) for creating tables. There's also a need for benchmarking and stress testing to ensure BigQuery limits are not hit under high load.
    *   **Solution:** Enhance CDC to fully support `DELETE`, `UPDATE`, and DML operations, and ensure performance under high load, potentially by exploring solutions like Datastream or Debezium.

3.  **Support time types in BigQuery destination:**
    *   **Problem:** Inserting Arrow `time64_ns` types into BigQuery results in an error on Linux.
    *   **Solution:** Enable proper support for time types, specifically `time64_ns`, to be inserted into BigQuery without errors.

Err..but at least we have text-embeddings support now! :|

Notes, gotchas, and tips #

Project ID: The name that appears in the GCP console is NOT the project ID, but the project name.
Remote model name: The remote model name is the name of the remote model in BigQuery. It is not the name of the model in Vertex AI. You create a remote model in BigQuery by specifying the connection to Vertex AI, and give it a friendly name.
Regions must match: Dataset, connection, remote model, and Vertex AI location should align.
Cost and quotas: Embeddings and vector search consume resources. Monitor usage in GCP.
Why BigQuery + Vertex AI? At scale, embedding and search directly in BigQuery lets you rely on Google’s infra for parallelism and resource management rather than operating your own vector database. In theory it should be cheaper, faster, and simpler to manage.

Where to go next #

GitHub source plugin docs: hub.cloudquery.io
BigQuery destination plugin docs: hub.cloudquery.io

BigQuery ML remote models: reference, tutorial
Text embeddings: how-to, API reference
Vector search in BigQuery: overview

With a single config and a small script, you get high-quality semantic search over GitHub issues—backed by BigQuery and Vertex AI.

No clusters to size, no indices to tune, and no vector database to run. You keep governance and cost controls in BigQuery, while Vertex AI handles embeddings and answers.

Start with issues today, then join them with product analytics, support tickets, or docs you already have in BigQuery to power calmer triage, faster prioritization, and sharper roadmaps. If your team runs on Google Cloud, this is the most frictionless way to add RAG to your workflow: sync with CloudQuery, search with BigQuery, and answer with Gemini.

Ready to turn your backlog into insight? Install the CLI, run the sync, and start asking smarter questions.

CloudQuery

Scale GitHub Issues semantic search with CloudQuery, BigQuery, and Vertex AI

What we’ll build #

Prerequisites #

Step 1 — Create a BigQuery remote model to Vertex AI #

Step 2 — CloudQuery config #

Step 3 — Ask questions (RAG) with a tiny Python script #

Notes, gotchas, and tips #

Where to go next #

Related posts

Finding AWS Lambda Functions Affected by Node.js 20.x End-of-Life

Finding AWS Lambda Functions Affected by Node.js 20.x End-of-Life