Engineering
How We Accidentally Invented a Cloud Search Language
All we wanted was a box where you could type something like:
(type:ec2 AND region:us-west-2) OR "production database"
…and instantly see matching assets across AWS, Azure, and GCP. Easy, right?
Turning that vision into reality meant building more than a search bar. It meant inventing a mini language, one that was intuitive for users to write, easy to validate, and flexible enough to support full-text search (FTS) and logical expressions. We thought we could hack it together with some regular expressions.
We were wrong.
Users see a search bar. Underneath, it's a small programming language.
We wanted FTS + structured queries for our asset inventory, but what looked simple quickly became a deep parsing challenge.
Regex experiment — unreadable and fragile #
At first, it worked fine for simple filters like:
region:us-east-1
However, this quickly exploded the complexity:
- How do we validate nested parentheses?
- What happens when a user forgets a closing
)
? - How do we differentiate "quoted strings" from field pairs?
- What about AND, OR, and NOT precedence?
- Nested logic:
(type:ec2 AND region:us-west-2)
?
Plus, it was nearly impossible for human beings to read. Every tweak to support a new requirement—nested groups, quoted strings, AND/OR precedence, felt like defusing a bomb. One change would fix one edge case and break five others.
Regex excels at straight-line patterns. It shines when you want to match an email or a simple field pair. But once you ask it to balance tokens or obey operator precedence, things crumble. We ran into three core headaches:
- Nesting: Regex can’t reliably match balanced parentheses without insane hacks. Once you try, maintenance becomes a nightmare.
- Logical operators: Handling AND, OR, NOT in the right order forced us into nested lookarounds that barely worked.
- Opaque errors: A malformed input like
(type:ec2 AND
would trigger a generic “no match” response. Zero guidance on what to fix.
Out of pure curiosity, we timed how a single regex handles deeper nesting and quoted content on Linux 4.4 / Python 3.11.8 (32 cores). Like this:
• Depth 1 (())
• Depth 5 (((((())))))
• Depth 10 ((((((((()))))))))
Here’s what we saw:
Sure, regex held up in sub-millisecond land—but that wasn’t our real headache.
In short, performance wasn’t the blocker—our ability to understand and evolve the pattern was. Next, we explore a more human-friendly way to handle complex queries…
Peggy #
That's when we discovered Peggy, a parser generator that lets you define a grammar using Parsing Expression Grammars (PEGs) and compiles it into a working parser.
Peggy clicked immediately.
Instead of juggling brittle regex patterns, we wrote formal grammar rules like:
Expression = head:Term tail:(_ ("AND" / "OR") _ Term)* { ... }
Want proof? This is the AST for
(type:ec2 OR type:rds)
AND region=us-west-2:{
"kind": "and",
"left": { "kind": "or", "...": "..." },
"right": { "kind": "comparison", "field": "region", "value": "us-west-2" }
}
This wasn't just easier to write, it was easier to read. Peggy gave us:
- A structured grammar we could reason about
- Syntax error feedback for invalid expressions
- An Abstract Syntax Tree (AST) to programmatically analyze and transform user input
Let's look at a fundamental part of our grammar, how we handle comparison expressions like
region="us-west-2"
or instance_count>5
:ComparisonExpression = identifier:Identifier _ op:ComparisonOperator _ value:Value? finalSpace:_ {
const isListValue = value && (value.kind === 'list-expression');
const isQuotedValue = value && (value.kind === 'quoted-string');
const hasTrailingSpace = finalSpace && finalSpace.length > 0;
let isValueComplete = false;
if (isListValue) {
isValueComplete = value.complete;
} else if (isQuotedValue) {
isValueComplete = value.complete;
} else {
isValueComplete = (!!value && hasTrailingSpace);
}
return {
kind: 'comparison-expression',
identifier,
identifierLocation: identifier.location,
op: op.value,
opLocation: op.location,
value: value === null ? undefined : value,
valueLocation: value ? value.location : undefined,
position: location(),
text: text(),
complete: isValueComplete,
context: 'value'
}
}
This rule matches expressions like
region="us-west-2"
and builds a structured object with the field, operator, and value. It also tracks position information for syntax highlighting and validation, as well as completeness for conversion to a chip in the filter bar.That ComparisonExpression rule is just one cog in the machine. It pulls out things like region="us-west-2", packaging the field, operator, and value into a neat object—and even tucks in position data for syntax highlighting and “chip” completeness checks. From there, every query follows the same clear path:
- Input String You type something like The lexer breaks the string into a stream of tokens—identifiers, operators, parentheses, quoted strings. Think of it as chopping your sentence into words and punctuation marks.
- Lexer The lexer breaks the string into a stream of tokens—identifiers, operators, parentheses, quoted strings. Think of it as chopping your sentence into words and punctuation marks.
- PEG Engine Peggy applies our grammar rules to those tokens. It recognizes structures (expressions, comparisons, lists) and enforces operator precedence. This step either succeeds (everything matches) or it pinpoints syntax errors.
- AST (Abstract Syntax Tree) The grammar produces a tree representation of your query. Each node is a chunk of meaning—“AND” nodes combine sub-expressions, comparison nodes hold field/operator/value, and so on. This tree is easy for code to inspect and transform.
- Planner Our planner walks the AST and figures out how to execute it efficiently. It decides on index scans, filter push downs, full-text search segments, and how to combine results.
- Query Plan Finally, you end up with a concrete plan—often a data structure or SQL statement—that the engine will run. This is the stage where performance optimizations (memoization hints, rule ordering) pay off.
By laying out each stage—lexer, PEG engine, AST, planner, query plan—you see why a formal grammar pays dividends. Instead of tangled regex backtracking, you get clear, maintainable steps that map exactly to how we process your input.
Chomsky and Compilers #
By this point, it was clear: we weren't just building a search bar. We were designing a domain-specific language (DSL) for querying cloud infrastructure.
What we were experiencing was a journey that countless language designers had walked before us. All of them had confronted the same fundamental challenge: how do you build a language that humans can write and machines can understand?
It turns out our journey from regex to formal grammar has deep theoretical roots. In the 1950s, linguist Noam Chomsky developed a hierarchy of formal grammars that still shapes how we think about parsing and language processing today.
Chomsky’s framework divides all possible grammars into four nested “power levels,” a bit like progressively smarter pattern‑matchers:
Think of it as concentric circles: each outer ring can do everything the inner ring can, plus more. We jumped from Type 3 (regex) to Type 2 (PEG.js) so we could support nested parentheses, operator precedence, and clean error handling without hand‑written hacks.
What we discovered maps directly to this framework. Our first approach with regex operated at Type 3 (Regular Grammar) level, which worked fine for simple
field:value
matching and basic text search. But as soon as our users needed nested parentheses and logical operators with precedence rules, we hit a wall.It explains exactly why our regex approach failed. Regular expressions are symbolic notations used to define regular languages generated by regular grammars (Type 3 in Chomsky's hierarchy). But our search language needed nested expressions and balanced parentheses, which are features that require at least a Type 2 grammar.
Okay, but what does this actually mean? #
Our text parser is essentially implementing what computer scientists call a context-free grammar, the same class of grammar that powers programming languages. This theoretical foundation gives us (and our users) practical benefits:
- Consistent error messages when you make a syntax mistake, we know exactly what went wrong
- Predictable behavior for the parser follows formal rules, not regex hacks
- Extensibility, so we can add new features without breaking existing queries
The power of this approach becomes obvious when you look at complex search expressions that our users can now write:
resource_type = "aws_ec2_instances" AND region = "us-west-2" AND instance.state = "running....
Lessons Learned #
Here's what surprised us the most: a problem that started as "let's build a better search bar" led us straight into the intersection of computer science, linguistics, and cognitive science.
This isn't a coincidence. The fundamental challenge of cloud infrastructure, managing complexity through abstraction, mirrors the challenges of human language itself. Both require systems that can express complex ideas through the composition of simpler elements.
By implementing a formal grammar for cloud resource queries, we're not just making it easier to find EC2 instances. We're creating a structured way for humans to communicate their intent to machines, bridging the gap between natural language flexibility and computational precision.
As engineers, we often focus on optimizing algorithms and data structures. But sometimes the most powerful tool is reconceptualizing the problem itself, in this case, recognizing that search is fundamentally a language problem.
Next Steps #
We're just getting started with our search language. Future improvements include:
- Extend grammar: add aliases, macros, saved filters
- Build UI helpers: autocomplete, linting, syntax hints
- Leverage this across CloudQuery: alerts, dashboards, rule engines
Final Thoughts #
What started as a UI improvement turned into a system design problem. Along the way, we rediscovered the hidden power of language theory.
- Regex is not a parser. Don't try to write your own unless you're doing something trivially simple.
- Use a parser generator. Tools like Peggy make it easier to define a grammar and maintain it.
- Design like a language. If your UI input is complex, it's probably a DSL—treat it as such.
- Humanities are not optional. Understanding syntax, ambiguity, and user language is key to building better tools.
Our implementation in the CloudQuery frontend repository shows how the grammar handles various expression types:
- Simple field queries:
region=us-west-2
- Text searches:
"production database"
- Logical operators:
ec2 AND production
- Parenthesized groups:
(type:rds OR type:aurora) AND environment:prod
- List operations:
region IN (us-west-1, us-west-2, us-east-1)
- Negation:
NOT type:s3
.
The next time you're tempted to reach for regex to parse complex input, remember: you might actually be building a language. Give parser generators like Peggy a look, your future self will thank you.
About CloudQuery #
CloudQuery is a developer-first cloud governance platform designed to provide security, compliance, and FinOps teams complete visibility into their cloud assets. By leveraging SQL-driven flexibility, CloudQuery enables you to easily query, automate, and optimize your cloud infrastructure's security posture, compliance requirements, and operational costs at scale.
For more information on how CloudQuery can help with your specific use case, check out our documentation or join our community. You can also follow us on Twitter and LinkedIn for the latest updates.