Improve GitHub Issues search effortlessly with CloudQuery, PgVector and OpenAI

Engineering

Tutorials

Improve GitHub Issues search effortlessly with CloudQuery, PgVector and OpenAI

•

Shipping fast means making sense of a lot of noise. If your team lives in GitHub Issues, you know how hard it is to spot the themes behind requests, bugs, and "would be nice" ideas. With CloudQuery, PostgreSQL + pgvector, and a tiny OpenAI-powered script, you can turn that backlog into a searchable, conversational knowledge base—so you get answers, not just results.

Here’s a lightweight setup you can run locally in minutes.

TL;DR:

With the new pgvector support in the PostgreSQL destination, CloudQuery automatically generates embeddings for issue titles and bodies. A minimal Python script embeds a question, retrieves relevant issue text-chunks directly in PostgreSQL using vector search, and asks OpenAI to answer using those relevant snippets. The result is a dramatically better GitHub Issues search experience. We use CloudQuery to sync open issues from our public cloudquery/cloudquery GitHub repository into PostgreSQL.

Why do this? #

Native issue search is great for simple keyword matches, but it struggles with synonyms, context, and multi-sentence questions. By storing text embeddings in PostgreSQL with pgvector, we can perform semantic search over GitHub issues and ask questions like “What are users asking for around Azure?” — and get meaningful, contextual answers.

What we’ll build #

Sync: CloudQuery pulls open issues from cloudquery/cloudquery into PostgreSQL.
Embed: The PostgreSQL destination’s pgvector config splits issue content into chunks and stores embeddings.
Ask: A small Python script embeds your question, uses pgvector similarity to fetch the most relevant issue chunks, and calls OpenAI to answer concisely.

Prerequisites #

CloudQuery CLI: Download CloudQuery and install.

PostgreSQL with pgvector: PostgreSQL 14+ (15+ recommended) with the pgvector extension installed (CloudQuery automatically enables it for your database).
Python 3.10+ with openai and psycopg2-binary packages.
OpenAI API key with access to text-embedding-3-small and a chat model (e.g., gpt-4o-mini).
GitHub token with repository read access (a Personal Access Token is sufficient).

Export the environment variables (the YAML below will expand them automatically):

export GITHUB_TOKEN="ghp_..."            # your GitHub PAT
export OPENAI_API_KEY="sk-..."           # your OpenAI API key
export POSTGRES_CONNECTION_STRING="host=localhost dbname=github user=postgres password=..."
# optional overrides
export EMBEDDING_MODEL="text-embedding-3-small"
export COMPLETION_MODEL="gpt-4o-mini"

CloudQuery config #

Create github_to_postgresql.yaml with the following content:

kind: source
spec:
  name: github
  path: cloudquery/github
  registry: cloudquery
  version: "v14.1.0"
  tables: ["github_issues"]
  destinations: ["postgresql"]
  spec:
    access_token: "${GITHUB_TOKEN}" # Personal Access Token, required if not using App Authentication.
    repos: ["cloudquery/cloudquery"]
    table_options:
      github_issues:
        state: "open" # e.g. "open, all, closed", defaults to `all`

---
kind: destination
spec:
  name: "postgresql"
  path: "cloudquery/postgresql"
  registry: "cloudquery"
  version: "v8.9.0"
  write_mode: "overwrite-delete-stale"
  spec:
    connection_string: ${POSTGRES_CONNECTION_STRING}
    pgvector_config:
      tables: 
        - source_table_name: github_issues
          target_table_name: github_issues_embeddings
          embed_columns: ["title", "body"] # list of columns to embed
        #   metadata_columns: [] # list of columns to copy from the source table as metadata
      text_splitter:
        recursive_text: # default text splitter
          chunk_size: 1000
          chunk_overlap: 500
      openai_embedding:
        api_key: ${OPENAI_API_KEY}
        model_name: text-embedding-3-small # the model to use for embedding
        dimensions: 1536 # the number of dimensions the embedding model is generating

Run the sync:

cloudquery sync github_to_postgresql.yaml

CloudQuery will create and populate github_issues and will also generate chunks and embeddings into github_issues_embeddings using pgvector.

Quick sanity checks:

-- Latest issues
SELECT number, title, state, updated_at
FROM github_issues
ORDER BY updated_at DESC
LIMIT 5;

-- Embeddings: confirm chunking and vector dimensions
SELECT LEFT(chunk, 120) AS chunk_preview,
       vector_dims(embedding) AS dims
FROM github_issues_embeddings
LIMIT 3;

Ask questions with a tiny Python script #

Set up a virtual environment (pyenv or venv both work). Here’s a minimal venv flow:

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install openai psycopg2-binary

Create main.py:

import psycopg2
import sys
from openai import OpenAI
import os

# ----------------------------
# CONFIG
# ----------------------------
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "sk-proj-.....")
POSTGRES_CONNECTION_STRING = os.getenv("POSTGRES_CONNECTION_STRING", "host=localhost dbname=github user=postgres password=...")
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-3-small")
COMPLETION_MODEL = os.getenv("COMPLETION_MODEL", "gpt-4o-mini")
CONTEXT_CHUNK_COUNT = 5
# ----------------------------

client = OpenAI(api_key=OPENAI_API_KEY)

def get_query_embedding(query: str):
    resp = client.embeddings.create(
        input=query,
        model=EMBEDDING_MODEL,
    )
    return resp.data[0].embedding

def get_most_similar_chunks(query_embedding, limit=5):
    """
    Searches pgvector table for the most similar chunks.
    """
    conn = psycopg2.connect(POSTGRES_CONNECTION_STRING)
    cur = conn.cursor()

    # Using <-> operator for cosine similarity in pgvector
    sql = """
    SELECT chunk
    FROM github_issues_embeddings
    ORDER BY embedding <-> %s::vector
    LIMIT %s
    """
    # Convert Python list to pgvector literal string: "[v1,v2,...]"
    embedding_literal = "[" + ",".join(str(v) for v in query_embedding) + "]"
    cur.execute(sql, (embedding_literal, limit))
    results = [r[0] for r in cur.fetchall()]
    cur.close()
    conn.close()
    return results

def answer_question(query):
    # Step 1: Embed the query
    query_emb = get_query_embedding(query)

    # Step 2: Retrieve top-k similar chunks
    chunks = get_most_similar_chunks(query_emb, limit=CONTEXT_CHUNK_COUNT)
    context = "\n\n".join(chunks)

    # Step 3: Ask GPT with context
    prompt = f"""
You are an assistant. Use the following content to answer the question.
Content:
{context}

Question:
{query}

Answer concisely:
"""
    resp = client.chat.completions.create(
        model=COMPLETION_MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return resp.choices[0].message.content

# ----------------------------
# USAGE
# ----------------------------
if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python main.py <question>")
        print("Example: python main.py 'What are people requesting for Azure support?'")
        sys.exit(1)
    
    question = sys.argv[1]
    answer = answer_question(question)
    print(answer)

Run it:

./.venv/bin/python main.py "What are people requesting for Azure support?"

This was the result the OpenAI model gave us for our public CloudQuery repository's GitHub issues:

People are requesting support for the following Azure features:

1. Programmatic management of Azure Service Quotas through Azure Quota Rest APIs.
2. Update of CQ's policy support to align with the latest CIS Microsoft Azure Foundations Benchmark v1.5.0.
3. Addition of Defender endpoints data to the Azure source plugin for detecting security vulnerabilities in machines.
4. Correct documentation for the `cloud_name` options in CloudQuery configuration, specifically for AzurePublic, AzureGovernment, and AzureChina.

Notes, gotchas, and tips #

Enable pgvector: If you see type "vector" does not exist, run CREATE EXTENSION vector; in your database.
Model dimensions must match: text-embedding-3-small produces 1536 dimensions; keep the YAML dimensions in sync with the model you use.
Cost awareness: Embeddings incur token costs. Limit synced tables/columns or adjust chunk sizes if needed.
Freshness: Re-run cloudquery sync to refresh issues and regenerate new chunks when content changes.

Indexes: For larger datasets, consider adding an index to speed up retrieval:

CREATE INDEX IF NOT EXISTS idx_github_issues_embeddings_embedding
ON github_issues_embeddings USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

Where to go next #

GitHub source plugin docs: hub.cloudquery.io
PostgreSQL destination plugin docs: hub.cloudquery.io
Download the CloudQuery CLI: cloudquery.io
Our public GitHub repository: github.com/cloudquery/cloudquery

With a single config and a small script, you get high-quality semantic search over GitHub issues—right inside your own PostgreSQL instance.

CloudQuery makes your GitHub data portable and intelligent: one config, a fast sync, and embeddings managed for you. Whether you’re sorting through issues, prioritizing roadmap items, or answering customer questions, semantic search over your own issues turns noise into signal.

Ready to level up your issues workflow? Install the CLI, run the sync, and start asking questions.

CloudQuery

Improve GitHub Issues search effortlessly with CloudQuery, PgVector and OpenAI

Why do this? #

What we’ll build #

Prerequisites #

CloudQuery config #

Ask questions with a tiny Python script #

Notes, gotchas, and tips #

Where to go next #

Related posts

Finding AWS Lambda Functions Affected by Node.js 20.x End-of-Life

Finding AWS Lambda Functions Affected by Node.js 20.x End-of-Life