Query your Cloud Asset Inventory Using Natural Language - Discover our MCP Server. Learn more ❯

CloudQuery

AI
Engineering
Tutorials

Improve GitHub Issues search effortlessly with CloudQuery, PgVector and OpenAI

Shipping fast means making sense of a lot of noise. If your team lives in GitHub Issues, you know how hard it is to spot the themes behind requests, bugs, and "would be nice" ideas. With CloudQuery, PostgreSQL + pgvector, and a tiny OpenAI-powered script, you can turn that backlog into a searchable, conversational knowledge base—so you get answers, not just results.
Here’s a lightweight setup you can run locally in minutes.

Why do this? #

Native issue search is great for simple keyword matches, but it struggles with synonyms, context, and multi-sentence questions. By storing text embeddings in PostgreSQL with pgvector, we can perform semantic search over GitHub issues and ask questions like “What are users asking for around Azure?” — and get meaningful, contextual answers.

What we’ll build #

  • Sync: CloudQuery pulls open issues from cloudquery/cloudquery into PostgreSQL.
  • Embed: The PostgreSQL destination’s pgvector config splits issue content into chunks and stores embeddings.
  • Ask: A small Python script embeds your question, uses pgvector similarity to fetch the most relevant issue chunks, and calls OpenAI to answer concisely.

Prerequisites #

  • PostgreSQL with pgvector: PostgreSQL 14+ (15+ recommended) with the pgvector extension installed (CloudQuery automatically enables it for your database).
  • Python 3.10+ with openai and psycopg2-binary packages.
  • OpenAI API key with access to text-embedding-3-small and a chat model (e.g., gpt-4o-mini).
  • GitHub token with repository read access (a Personal Access Token is sufficient).
Export the environment variables (the YAML below will expand them automatically):
export GITHUB_TOKEN="ghp_..."            # your GitHub PAT
export OPENAI_API_KEY="sk-..."           # your OpenAI API key
export POSTGRES_CONNECTION_STRING="host=localhost dbname=github user=postgres password=..."
# optional overrides
export EMBEDDING_MODEL="text-embedding-3-small"
export COMPLETION_MODEL="gpt-4o-mini"

CloudQuery config #

Create github_to_postgresql.yaml with the following content:
kind: source
spec:
  name: github
  path: cloudquery/github
  registry: cloudquery
  version: "v14.1.0"
  tables: ["github_issues"]
  destinations: ["postgresql"]
  spec:
    access_token: "${GITHUB_TOKEN}" # Personal Access Token, required if not using App Authentication.
    repos: ["cloudquery/cloudquery"]
    table_options:
      github_issues:
        state: "open" # e.g. "open, all, closed", defaults to `all`

---
kind: destination
spec:
  name: "postgresql"
  path: "cloudquery/postgresql"
  registry: "cloudquery"
  version: "v8.9.0"
  write_mode: "overwrite-delete-stale"
  spec:
    connection_string: ${POSTGRES_CONNECTION_STRING}
    pgvector_config:
      tables: 
        - source_table_name: github_issues
          target_table_name: github_issues_embeddings
          embed_columns: ["title", "body"] # list of columns to embed
        #   metadata_columns: [] # list of columns to copy from the source table as metadata
      text_splitter:
        recursive_text: # default text splitter
          chunk_size: 1000
          chunk_overlap: 500
      openai_embedding:
        api_key: ${OPENAI_API_KEY}
        model_name: text-embedding-3-small # the model to use for embedding
        dimensions: 1536 # the number of dimensions the embedding model is generating
Run the sync:
cloudquery sync github_to_postgresql.yaml
CloudQuery will create and populate github_issues and will also generate chunks and embeddings into github_issues_embeddings using pgvector.
Quick sanity checks:
-- Latest issues
SELECT number, title, state, updated_at
FROM github_issues
ORDER BY updated_at DESC
LIMIT 5;

-- Embeddings: confirm chunking and vector dimensions
SELECT LEFT(chunk, 120) AS chunk_preview,
       vector_dims(embedding) AS dims
FROM github_issues_embeddings
LIMIT 3;

Ask questions with a tiny Python script #

Set up a virtual environment (pyenv or venv both work). Here’s a minimal venv flow:
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install openai psycopg2-binary
Create main.py:
import psycopg2
import sys
from openai import OpenAI
import os

# ----------------------------
# CONFIG
# ----------------------------
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "sk-proj-.....")
POSTGRES_CONNECTION_STRING = os.getenv("POSTGRES_CONNECTION_STRING", "host=localhost dbname=github user=postgres password=...")
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-3-small")
COMPLETION_MODEL = os.getenv("COMPLETION_MODEL", "gpt-4o-mini")
CONTEXT_CHUNK_COUNT = 5
# ----------------------------

client = OpenAI(api_key=OPENAI_API_KEY)

def get_query_embedding(query: str):
    resp = client.embeddings.create(
        input=query,
        model=EMBEDDING_MODEL,
    )
    return resp.data[0].embedding

def get_most_similar_chunks(query_embedding, limit=5):
    """
    Searches pgvector table for the most similar chunks.
    """
    conn = psycopg2.connect(POSTGRES_CONNECTION_STRING)
    cur = conn.cursor()

    # Using <-> operator for cosine similarity in pgvector
    sql = """
    SELECT chunk
    FROM github_issues_embeddings
    ORDER BY embedding <-> %s::vector
    LIMIT %s
    """
    # Convert Python list to pgvector literal string: "[v1,v2,...]"
    embedding_literal = "[" + ",".join(str(v) for v in query_embedding) + "]"
    cur.execute(sql, (embedding_literal, limit))
    results = [r[0] for r in cur.fetchall()]
    cur.close()
    conn.close()
    return results

def answer_question(query):
    # Step 1: Embed the query
    query_emb = get_query_embedding(query)

    # Step 2: Retrieve top-k similar chunks
    chunks = get_most_similar_chunks(query_emb, limit=CONTEXT_CHUNK_COUNT)
    context = "\n\n".join(chunks)

    # Step 3: Ask GPT with context
    prompt = f"""
You are an assistant. Use the following content to answer the question.
Content:
{context}

Question:
{query}

Answer concisely:
"""
    resp = client.chat.completions.create(
        model=COMPLETION_MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return resp.choices[0].message.content

# ----------------------------
# USAGE
# ----------------------------
if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python main.py <question>")
        print("Example: python main.py 'What are people requesting for Azure support?'")
        sys.exit(1)
    
    question = sys.argv[1]
    answer = answer_question(question)
    print(answer)
Run it:
./.venv/bin/python main.py "What are people requesting for Azure support?"
This was the result the OpenAI model gave us for our public CloudQuery repository's GitHub issues:
People are requesting support for the following Azure features:

1. Programmatic management of Azure Service Quotas through Azure Quota Rest APIs.
2. Update of CQ's policy support to align with the latest CIS Microsoft Azure Foundations Benchmark v1.5.0.
3. Addition of Defender endpoints data to the Azure source plugin for detecting security vulnerabilities in machines.
4. Correct documentation for the `cloud_name` options in CloudQuery configuration, specifically for AzurePublic, AzureGovernment, and AzureChina.

Notes, gotchas, and tips #

  • Enable pgvector: If you see type "vector" does not exist, run CREATE EXTENSION vector; in your database.
  • Model dimensions must match: text-embedding-3-small produces 1536 dimensions; keep the YAML dimensions in sync with the model you use.
  • Cost awareness: Embeddings incur token costs. Limit synced tables/columns or adjust chunk sizes if needed.
  • Freshness: Re-run cloudquery sync to refresh issues and regenerate new chunks when content changes.
  • Indexes: For larger datasets, consider adding an index to speed up retrieval:
    CREATE INDEX IF NOT EXISTS idx_github_issues_embeddings_embedding
    ON github_issues_embeddings USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);
    

Where to go next #

With a single config and a small script, you get high-quality semantic search over GitHub issues—right inside your own PostgreSQL instance.
CloudQuery makes your GitHub data portable and intelligent: one config, a fast sync, and embeddings managed for you. Whether you’re sorting through issues, prioritizing roadmap items, or answering customer questions, semantic search over your own issues turns noise into signal.
Ready to level up your issues workflow? Install the CLI, run the sync, and start asking questions.
 

Related posts

Turn cloud chaos into clarity

Find out how CloudQuery can help you get clarity from a chaotic cloud environment with a personalized conversation and demo.


© 2025 CloudQuery, Inc. All rights reserved.