Engineering
How We Built a Search Language for the Cloud
It started with a simple idea. We wanted users of CloudQuery to type natural, flexible queries into a search bar and instantly filter resources across AWS, Azure, and GCP. Something like:
(type:ec2 AND region:us-west-2) OR "production database"
Turning that vision into reality meant building more than a search bar. It meant inventing a mini language, one that was intuitive for users to write, easy to validate, and flexible enough to support full-text search (FTS) and logical expressions. We thought we could hack it together with some regular expressions.
We were wrong.
Users see a search bar. Underneath, it's a small programming language.
We wanted FTS + structured queries for our asset inventory, but what looked simple quickly became a deep parsing challenge.
Regex #
Like many engineers before us, we first reached for the most accessible tool: regular expressions. We started writing a homegrown parser using regex to tokenize and validate input.
At first, it worked fine for simple filters like:
region:us-east-1
This quickly exploded the complexity:
- How do we validate nested parentheses?
- What happens when a user forgets a closing
)
? - How do we differentiate "quoted strings" from field pairs?
- What about AND, OR, and NOT precedence?
- Nested logic:
(type:ec2 AND region:us-west-2)
?
Regex devolved into chaos. One edit broke five other things. Every bug fix introduced new edge cases. And trying to validate entire nested expressions? Borderline impossible.
Peggy #
That's when we discovered Peggy, a parser generator that lets you define a grammar using Parsing Expression Grammars (PEGs) and compiles it into a working parser.
Peggy clicked immediately.
Instead of juggling brittle regex patterns, we wrote formal grammar rules like:
Expression = head:Term tail:(_ ("AND" / "OR") _ Term)* { ... }
This wasn't just easier to write, it was easier to read. Peggy gave us:
- A structured grammar we could reason about
- Syntax error feedback for invalid expressions
- An Abstract Syntax Tree (AST) to programmatically analyze and transform user input
Let's look at a fundamental part of our grammar, how we handle comparison expressions like
region="us-west-2"
or instance_count>5
:ComparisonExpression = identifier:Identifier _ op:ComparisonOperator _ value:Value? finalSpace:_ {
const isListValue = value && (value.kind === 'list-expression');
const isQuotedValue = value && (value.kind === 'quoted-string');
const hasTrailingSpace = finalSpace && finalSpace.length > 0;
let isValueComplete = false;
if (isListValue) {
isValueComplete = value.complete;
} else if (isQuotedValue) {
isValueComplete = value.complete;
} else {
isValueComplete = (!!value && hasTrailingSpace);
}
return {
kind: 'comparison-expression',
identifier,
identifierLocation: identifier.location,
op: op.value,
opLocation: op.location,
value: value === null ? undefined : value,
valueLocation: value ? value.location : undefined,
position: location(),
text: text(),
complete: isValueComplete,
context: 'value'
}
}
This rule matches expressions like
region="us-west-2"
and builds a structured object with the field, operator, and value. It also tracks position information for syntax highlighting and validation, as well as completeness for conversion to a chip in the filter bar.Chomsky and Compilers #
By this point, it was clear: we weren't just building a search bar. We were designing a domain-specific language (DSL) for querying cloud infrastructure.
What we were experiencing was a journey that countless language designers had walked before us. All of them had confronted the same fundamental challenge: how do you build a language that humans can write and machines can understand?
It turns out our journey from regex to formal grammar has deep theoretical roots. In the 1950s, linguist Noam Chomsky developed a hierarchy of formal grammars that still shapes how we think about parsing and language processing today.
Chomsky’s framework divides all possible grammars into four nested “power levels,” a bit like progressively smarter pattern‑matchers:
Think of it as concentric circles: each outer ring can do everything the inner ring can, plus more. We jumped from Type 3 (regex) to Type 2 (PEG.js) so we could support nested parentheses, operator precedence, and clean error handling without hand‑written hacks.
What we discovered maps directly to this framework. Our first approach with regex operated at Type 3 (Regular Grammar) level, which worked fine for simple
field:value
matching and basic text search. But as soon as our users needed nested parentheses and logical operators with precedence rules, we hit a wall.It explains exactly why our regex approach failed. Regular expressions are symbolic notations used to define regular languages generated by regular grammars (Type 3 in Chomsky's hierarchy). But our search language needed nested expressions and balanced parentheses, which are features that require at least a Type 2 grammar.
Okay, but what does this actually mean? #
Our text parser is essentially implementing what computer scientists call a context-free grammar, the same class of grammar that powers programming languages. This theoretical foundation gives us (and our users) practical benefits:
- Consistent error messages when you make a syntax mistake, we know exactly what went wrong
- Predictable behavior for the parser follows formal rules, not regex hacks
- Extensibility, so we can add new features without breaking existing queries
The power of this approach becomes obvious when you look at complex search expressions that our users can now write:
resource_type = "aws_ec2_instances" AND region = "us-west-2" AND instance.state = "running....
Lessons Learned #
Here's what surprised us the most: a problem that started as "let's build a better search bar" led us straight into the intersection of computer science, linguistics, and cognitive science.
This isn't a coincidence. The fundamental challenge of cloud infrastructure, managing complexity through abstraction, mirrors the challenges of human language itself. Both require systems that can express complex ideas through the composition of simpler elements.
By implementing a formal grammar for cloud resource queries, we're not just making it easier to find EC2 instances. We're creating a structured way for humans to communicate their intent to machines, bridging the gap between natural language flexibility and computational precision.
As engineers, we often focus on optimizing algorithms and data structures. But sometimes the most powerful tool is reconceptualizing the problem itself, in this case, recognizing that search is fundamentally a language problem.
Next Steps #
We're just getting started with our search language. Future improvements include:
- Extend grammar: add aliases, macros, saved filters
- Build UI helpers: autocomplete, linting, syntax hints
- Leverage this across CloudQuery: alerts, dashboards, rule engines
Final Thoughts #
What started as a UI improvement turned into a system design problem. Along the way, we rediscovered the hidden power of language theory.
- Regex is not a parser. Don't try to write your own unless you're doing something trivially simple.
- Use a parser generator. Tools like Peggy make it easier to define a grammar and maintain it.
- Design like a language. If your UI input is complex, it's probably a DSL—treat it as such.
- Humanities are not optional. Understanding syntax, ambiguity, and user language is key to building better tools.
Our implementation in the CloudQuery frontend repository shows how the grammar handles various expression types:
- Simple field queries:
region=us-west-2
- Text searches:
"production database"
- Logical operators:
ec2 AND production
- Parenthesized groups:
(type:rds OR type:aurora) AND environment:prod
- List operations:
region IN (us-west-1, us-west-2, us-east-1)
- Negation:
NOT type:s3
.
The next time you're tempted to reach for regex to parse complex input, remember: you might actually be building a language. Give parser generators like Peggy a look, your future self will thank you.
About CloudQuery #
CloudQuery is a developer-first cloud governance platform designed to provide security, compliance, and FinOps teams complete visibility into their cloud assets. By leveraging SQL-driven flexibility, CloudQuery enables you to easily query, automate, and optimize your cloud infrastructure's security posture, compliance requirements, and operational costs at scale.
The full-text search implementation we've described in this post is a perfect example of how we've engineered our platform to make cloud infrastructure more accessible and manageable. With it, users can instantly find any resource across their cloud inventory during critical situations like incident response, troubleshooting, or security investigations.
Ready to see how CloudQuery can transform your cloud visibility? Our team can walk you through a tailored demo based on your cloud environment and use cases. Let's talk about how CloudQuery can fit into your stack. Schedule a demo today.
For more information on how CloudQuery can help with your specific use case, check out our documentation or join our community.