AI
Engineering
Tutorials
Scale GitHub Issues semantic search with CloudQuery, BigQuery, and Vertex AI
Backlogs grow fast. Finding the themes behind requests, bugs, and "nice to haves" shouldn't require spelunking. In our previous post, we built semantic issue search with PostgreSQL + pgvector and OpenAI — great if you want to run it yourself and keep everything in your database. This article shows the fully managed path on Google Cloud: BigQuery stores and searches your embeddings, Vertex AI generates them and answers with Gemini. Same outcome; far less ops, elastic scale, and pay‑per‑query simplicity. If you prefer the self‑hosted route, check out the pgvector/OpenAI version here: Improve GitHub Issues search with CloudQuery, PgVector and OpenAI.
What we’ll build #
- Sync: CloudQuery pulls open issues from the
cloudquery/cloudquery
GitHub repository into BigQuery. - Embed: As part of the sync, BigQuery calls Vertex AI through a remote model to create text embeddings and store them alongside chunks on a separate table.
- Ask: A small Python script takes a question, embeds it, searches against existing embeddings with
VECTOR_SEARCH
in BigQuery, and has Gemini write the answer based on retrieved snippets.
Prerequisites #
- CloudQuery CLI: Install from the site (cloudquery.io). If above free-tier usage, a
CLOUDQUERY_API_KEY
is required.
- GitHub Access Token: PAT with repository read access for fetching issues.
- GCP project & dataset: We’ll call it
ai-playground
. Also a BigQuery dataset namedgithub_issues
needs to be created. - APIs enabled in the project:
- BigQuery API
- BigQuery Connection API
- Vertex AI API
- IAM permissions:
- For you (the auth’d user) within the project:
- roles/bigquery.admin
- roles/resourcemanager.projectIamAdmin
- For the BigQuery connection’s service account (we'll create this soon): Vertex AI User role
See also: BigQuery ML remote models (reference), remote model tutorial (guide), text embeddings overview (guide), and embeddings API reference (guide).
Step 1 — Create a BigQuery remote model to Vertex AI #
It’s a bit of setup, but once you get here the rest is trivial. We need a BigQuery dataset containing a remote model that points at Vertex AI’s embeddings endpoint.
- Open your GCP project
- Create a BigQuery dataset (or pick an existing one, e.g.,
github_issues
)
- Fill out the dataset form (choose location and defaults)
- Enable required APIs
- In BigQuery, click "+ Add data" to start creating a connection
- Add a Vertex AI connection (Business Applications → Vertex AI Models: BigQuery Federation)
- Grant the generated service account Vertex AI User in IAM
- Create the remote model in your dataset (name it
textembedding
)
Note that the project ID is not the project name (which is
ai-playground
in this case). GCP picked triple-shift-469512-k6
as ID for this project.CREATE OR REPLACE MODEL `triple-shift-469512-k6.github_issues_dataset.textembedding`
REMOTE WITH CONNECTION `projects/triple-shift-469512-k6/locations/us/connections/bqml_connection`
OPTIONS (
ENDPOINT = 'gemini-embedding-001'
);
If this succeeds, you’re ready for embeddings directly from SQL. Note the model name (i.e.
textembedding
) is a friendly name for the remote model in BigQuery's dataset, not the name of the model in Vertex AI.Step 2 — CloudQuery config #
Create
github_to_bigquery.yaml
with the following content:kind: source
spec:
name: github
path: cloudquery/github
registry: cloudquery
version: "v14.1.1"
tables: ["github_issues"]
destinations: ["bigquery"]
spec:
access_token: "${GITHUB_TOKEN}" # Personal Access Token, required if not using App Authentication.
repos: ["cloudquery/cloudquery"]
table_options:
github_issues:
state: "open" # e.g. "open, all, closed", defaults to `all`
---
kind: destination
spec:
name: bigquery
path: cloudquery/bigquery
registry: cloudquery
version: "v4.3.9"
write_mode: "append"
spec:
project_id: ai-playground
dataset_id: github_issues
text_embeddings:
tables:
- source_table_name: github_issues
target_table_name: github_issues_embeddings
embed_columns: ["title", "body"] # list of columns to embed
text_splitter:
recursive_text: # default text splitter
chunk_size: 1000
chunk_overlap: 500
remote_model_name: "textembedding"
Notes:
GITHUB_TOKEN
and any other environment variables will be expanded by CloudQuery when the sync is run.- Your environment must be authenticated against the GCP project. Instructions for this can be found in the BigQuery destination plugin docs.
Run the sync:
cloudquery sync github_to_bigquery.yaml
Example successful output:
$ cloudquery sync github_to_bigquery.yaml
Loading spec(s) from github_to_bigquery.yaml
Starting sync for: github (cloudquery/[email protected]) -> [bigquery (cloudquery/[email protected])]
Sync completed successfully. Resources: 157, Errors: 0, Warnings: 0, Time: 31s
Step 3 — Ask questions (RAG) with a tiny Python script #
The script below embeds your query via
ML.GENERATE_EMBEDDING
, searches your embeddings table with VECTOR_SEARCH
(cosine distance), and asks Gemini to answer using the retrieved chunks.from google import genai
from google.cloud import bigquery
import sys
import os
from dotenv import load_dotenv
load_dotenv()
PROJECT_ID = os.getenv("PROJECT_ID")
LOCATION = os.getenv("LOCATION")
DATASET_ID = os.getenv("DATASET_ID")
TARGET_TABLE_ID = os.getenv("TARGET_TABLE_ID", "github_issues_embeddings")
REMOTE_MODEL_NAME = os.getenv("REMOTE_MODEL_NAME", "textembedding")
GEMINI_MODEL = os.getenv("GEMINI_MODEL", "gemini-2.5-flash")
TOP_K = int(os.getenv("TOP_K", "5"))
def generate_query_embedding(query_text):
"""Generate embedding for the search query using the remote model."""
embedding_query = f"""
SELECT
ml_generate_embedding_result AS embedding
FROM
ML.GENERATE_EMBEDDING(
MODEL `{PROJECT_ID}.{DATASET_ID}.{REMOTE_MODEL_NAME}`,
(
SELECT '{query_text}' AS content, 'query' AS title
)
)
"""
client = bigquery.Client()
query_job = client.query(embedding_query)
results = query_job.result()
for row in results:
return row.embedding
raise ValueError("No embedding generated for query")
def search_documents(query_text, top_k=None):
"""Search for relevant documents using VECTOR_SEARCH with cosine distance."""
if top_k is None:
top_k = TOP_K
client = bigquery.Client()
query_embedding = generate_query_embedding(query_text)
search_query = f"""
SELECT
base.chunk_id as chunk_id,
base.chunk_text as chunk_text
FROM VECTOR_SEARCH(
TABLE `{PROJECT_ID}.{DATASET_ID}.{TARGET_TABLE_ID}`,
'embedding',
(
SELECT ARRAY<FLOAT64>{query_embedding} AS embedding
),
'embedding',
top_k => {top_k},
distance_type => 'COSINE'
)
"""
query_job = client.query(search_query)
results = query_job.result()
return [
{
'chunk_id': row.chunk_id,
'chunk_text': row.chunk_text,
}
for row in results
]
def generate_response(query_text, search_results):
"""Generate a response using Gemini based on search results."""
if not search_results:
return "I couldn't find any relevant information to answer your question."
client = genai.Client(vertexai=PROJECT_ID, project=PROJECT_ID, location=LOCATION)
context = "\n\n".join([
f"(chunk {result['chunk_id']}): {result['chunk_text']}"
for result in search_results
])
prompt = f"""
Based on the following context, please answer the user's question.
If the context doesn't contain enough information to answer the question,
please say so clearly.
Context:
{context}
User Question: {query_text}
Please provide a comprehensive answer based on the context provided.
"""
response = client.models.generate_content(model=GEMINI_MODEL, contents=prompt)
return response.text
def main():
query = sys.argv[1]
print(generate_response(query, search_documents(query)))
if __name__ == "__main__":
main()
Let's test it to see how our BigQuery plugin is doing!
$ python search.py "Summarise any feature requests for the BigQuery destination plugin"
Based on the provided context, here are the feature requests for the BigQuery destination plugin:
1. **Add overwrite or overwrite-delete-stale write modes:**
* **Problem:** Currently lacks support for incremental syncs in the BigQuery destination plugin.
* **Solution:** Introduce `write` or `overwrite-delete-stale` write modes to enable incremental data synchronization.
2. **Full CDC (Change Data Capture) support for PostgreSQL -> BigQuery:**
* **Problem:** Existing CDC support from PostgreSQL to BigQuery is partial, specifically missing handling for `DELETE` operations, `UPDATE` operations, and DML (Data Manipulation Language) for creating tables. There's also a need for benchmarking and stress testing to ensure BigQuery limits are not hit under high load.
* **Solution:** Enhance CDC to fully support `DELETE`, `UPDATE`, and DML operations, and ensure performance under high load, potentially by exploring solutions like Datastream or Debezium.
3. **Support time types in BigQuery destination:**
* **Problem:** Inserting Arrow `time64_ns` types into BigQuery results in an error on Linux.
* **Solution:** Enable proper support for time types, specifically `time64_ns`, to be inserted into BigQuery without errors.
Err..but at least we have text-embeddings support now! :|
Notes, gotchas, and tips #
- Project ID: The name that appears in the GCP console is NOT the project ID, but the project name.
- Remote model name: The remote model name is the name of the remote model in BigQuery. It is not the name of the model in Vertex AI. You create a remote model in BigQuery by specifying the connection to Vertex AI, and give it a friendly name.
- Regions must match: Dataset, connection, remote model, and Vertex AI location should align.
- Cost and quotas: Embeddings and vector search consume resources. Monitor usage in GCP.
- Why BigQuery + Vertex AI? At scale, embedding and search directly in BigQuery lets you rely on Google’s infra for parallelism and resource management rather than operating your own vector database. In theory it should be cheaper, faster, and simpler to manage.
Where to go next #
- GitHub source plugin docs: hub.cloudquery.io
- BigQuery destination plugin docs: hub.cloudquery.io
- Text embeddings: how-to, API reference
- Vector search in BigQuery: overview
With a single config and a small script, you get high-quality semantic search over GitHub issues—backed by BigQuery and Vertex AI.
No clusters to size, no indices to tune, and no vector database to run. You keep governance and cost controls in BigQuery, while Vertex AI handles embeddings and answers.
Start with issues today, then join them with product analytics, support tickets, or docs you already have in BigQuery to power calmer triage, faster prioritization, and sharper roadmaps. If your team runs on Google Cloud, this is the most frictionless way to add RAG to your workflow: sync with CloudQuery, search with BigQuery, and answer with Gemini.
Ready to turn your backlog into insight? Install the CLI, run the sync, and start asking smarter questions.