AI
Engineering
Tutorials
Improve GitHub Issues search effortlessly with CloudQuery, PgVector and OpenAI
Shipping fast means making sense of a lot of noise. If your team lives in GitHub Issues, you know how hard it is to spot the themes behind requests, bugs, and "would be nice" ideas. With CloudQuery, PostgreSQL + pgvector, and a tiny OpenAI-powered script, you can turn that backlog into a searchable, conversational knowledge base—so you get answers, not just results.
Here’s a lightweight setup you can run locally in minutes.
Why do this? #
Native issue search is great for simple keyword matches, but it struggles with synonyms, context, and multi-sentence questions. By storing text embeddings in PostgreSQL with pgvector, we can perform semantic search over GitHub issues and ask questions like “What are users asking for around Azure?” — and get meaningful, contextual answers.
What we’ll build #
- Sync: CloudQuery pulls open issues from
cloudquery/cloudquery
into PostgreSQL. - Embed: The PostgreSQL destination’s pgvector config splits issue content into chunks and stores embeddings.
- Ask: A small Python script embeds your question, uses pgvector similarity to fetch the most relevant issue chunks, and calls OpenAI to answer concisely.
Prerequisites #
- CloudQuery CLI: Download CloudQuery and install.
- PostgreSQL with pgvector: PostgreSQL 14+ (15+ recommended) with the
pgvector
extension installed (CloudQuery automatically enables it for your database). - Python 3.10+ with
openai
andpsycopg2-binary
packages. - OpenAI API key with access to
text-embedding-3-small
and a chat model (e.g.,gpt-4o-mini
). - GitHub token with repository read access (a Personal Access Token is sufficient).
Export the environment variables (the YAML below will expand them automatically):
export GITHUB_TOKEN="ghp_..." # your GitHub PAT
export OPENAI_API_KEY="sk-..." # your OpenAI API key
export POSTGRES_CONNECTION_STRING="host=localhost dbname=github user=postgres password=..."
# optional overrides
export EMBEDDING_MODEL="text-embedding-3-small"
export COMPLETION_MODEL="gpt-4o-mini"
CloudQuery config #
Create
github_to_postgresql.yaml
with the following content:kind: source
spec:
name: github
path: cloudquery/github
registry: cloudquery
version: "v14.1.0"
tables: ["github_issues"]
destinations: ["postgresql"]
spec:
access_token: "${GITHUB_TOKEN}" # Personal Access Token, required if not using App Authentication.
repos: ["cloudquery/cloudquery"]
table_options:
github_issues:
state: "open" # e.g. "open, all, closed", defaults to `all`
---
kind: destination
spec:
name: "postgresql"
path: "cloudquery/postgresql"
registry: "cloudquery"
version: "v8.9.0"
write_mode: "overwrite-delete-stale"
spec:
connection_string: ${POSTGRES_CONNECTION_STRING}
pgvector_config:
tables:
- source_table_name: github_issues
target_table_name: github_issues_embeddings
embed_columns: ["title", "body"] # list of columns to embed
# metadata_columns: [] # list of columns to copy from the source table as metadata
text_splitter:
recursive_text: # default text splitter
chunk_size: 1000
chunk_overlap: 500
openai_embedding:
api_key: ${OPENAI_API_KEY}
model_name: text-embedding-3-small # the model to use for embedding
dimensions: 1536 # the number of dimensions the embedding model is generating
Run the sync:
cloudquery sync github_to_postgresql.yaml
CloudQuery will create and populate
github_issues
and will also generate chunks and embeddings into github_issues_embeddings
using pgvector.Quick sanity checks:
-- Latest issues
SELECT number, title, state, updated_at
FROM github_issues
ORDER BY updated_at DESC
LIMIT 5;
-- Embeddings: confirm chunking and vector dimensions
SELECT LEFT(chunk, 120) AS chunk_preview,
vector_dims(embedding) AS dims
FROM github_issues_embeddings
LIMIT 3;
Ask questions with a tiny Python script #
Set up a virtual environment (pyenv or venv both work). Here’s a minimal venv flow:
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install openai psycopg2-binary
Create
main.py
:import psycopg2
import sys
from openai import OpenAI
import os
# ----------------------------
# CONFIG
# ----------------------------
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "sk-proj-.....")
POSTGRES_CONNECTION_STRING = os.getenv("POSTGRES_CONNECTION_STRING", "host=localhost dbname=github user=postgres password=...")
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-3-small")
COMPLETION_MODEL = os.getenv("COMPLETION_MODEL", "gpt-4o-mini")
CONTEXT_CHUNK_COUNT = 5
# ----------------------------
client = OpenAI(api_key=OPENAI_API_KEY)
def get_query_embedding(query: str):
resp = client.embeddings.create(
input=query,
model=EMBEDDING_MODEL,
)
return resp.data[0].embedding
def get_most_similar_chunks(query_embedding, limit=5):
"""
Searches pgvector table for the most similar chunks.
"""
conn = psycopg2.connect(POSTGRES_CONNECTION_STRING)
cur = conn.cursor()
# Using <-> operator for cosine similarity in pgvector
sql = """
SELECT chunk
FROM github_issues_embeddings
ORDER BY embedding <-> %s::vector
LIMIT %s
"""
# Convert Python list to pgvector literal string: "[v1,v2,...]"
embedding_literal = "[" + ",".join(str(v) for v in query_embedding) + "]"
cur.execute(sql, (embedding_literal, limit))
results = [r[0] for r in cur.fetchall()]
cur.close()
conn.close()
return results
def answer_question(query):
# Step 1: Embed the query
query_emb = get_query_embedding(query)
# Step 2: Retrieve top-k similar chunks
chunks = get_most_similar_chunks(query_emb, limit=CONTEXT_CHUNK_COUNT)
context = "\n\n".join(chunks)
# Step 3: Ask GPT with context
prompt = f"""
You are an assistant. Use the following content to answer the question.
Content:
{context}
Question:
{query}
Answer concisely:
"""
resp = client.chat.completions.create(
model=COMPLETION_MODEL,
messages=[{"role": "user", "content": prompt}],
temperature=0,
)
return resp.choices[0].message.content
# ----------------------------
# USAGE
# ----------------------------
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python main.py <question>")
print("Example: python main.py 'What are people requesting for Azure support?'")
sys.exit(1)
question = sys.argv[1]
answer = answer_question(question)
print(answer)
Run it:
./.venv/bin/python main.py "What are people requesting for Azure support?"
This was the result the OpenAI model gave us for our public CloudQuery repository's GitHub issues:
People are requesting support for the following Azure features:
1. Programmatic management of Azure Service Quotas through Azure Quota Rest APIs.
2. Update of CQ's policy support to align with the latest CIS Microsoft Azure Foundations Benchmark v1.5.0.
3. Addition of Defender endpoints data to the Azure source plugin for detecting security vulnerabilities in machines.
4. Correct documentation for the `cloud_name` options in CloudQuery configuration, specifically for AzurePublic, AzureGovernment, and AzureChina.
Notes, gotchas, and tips #
- Enable pgvector: If you see
type "vector" does not exist
, runCREATE EXTENSION vector;
in your database. - Model dimensions must match:
text-embedding-3-small
produces 1536 dimensions; keep the YAMLdimensions
in sync with the model you use. - Cost awareness: Embeddings incur token costs. Limit synced tables/columns or adjust chunk sizes if needed.
- Freshness: Re-run
cloudquery sync
to refresh issues and regenerate new chunks when content changes. - Indexes: For larger datasets, consider adding an index to speed up retrieval:
CREATE INDEX IF NOT EXISTS idx_github_issues_embeddings_embedding ON github_issues_embeddings USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
Where to go next #
- GitHub source plugin docs: hub.cloudquery.io
- PostgreSQL destination plugin docs: hub.cloudquery.io
- Download the CloudQuery CLI: cloudquery.io
- Our public GitHub repository: github.com/cloudquery/cloudquery
With a single config and a small script, you get high-quality semantic search over GitHub issues—right inside your own PostgreSQL instance.
CloudQuery makes your GitHub data portable and intelligent: one config, a fast sync, and embeddings managed for you. Whether you’re sorting through issues, prioritizing roadmap items, or answering customer questions, semantic search over your own issues turns noise into signal.
Ready to level up your issues workflow? Install the CLI, run the sync, and start asking questions.