How to Build a Custom Query Language on Apache Spark with Lark

You don't need to build a custom query engine to give analysts a domain-specific language. Nchammas's EHQL (Entity History Query Language) compiles down to Apache Spark jobs using a Lark grammar in about 50 lines of EBNF - no storage format or execution engine to design.

Grammar as a Compilation Target

EHQL targets vehicle maintenance analysts who work with Parquet data in a central catalog. Instead of building a custom storage layer (a massive undertaking), Nchammas defines a grammar in Lark and transforms the parse tree into Spark queries. The key insight: your grammar is just a compilation frontend for an existing execution engine.

Lark grammars are expressed in EBNF with some Python-specific niceties. The minimal grammar for EHQL covers the history contains: "oil change" "transmission fluid change" pattern using terminals like ESCAPED_STRING renamed to QUOTED_STRING, and ignores inline whitespace and SQL-style comments via %ignore directives. Two %ignore lines make history contains: and history contains: parse identically.

Why Lark Beats ANTLR for Python

Nchammas compared Lark with ANTLR, the standard JVM parser. Lark is faster, more popular in the Python ecosystem, and more feature-rich according to his survey. ANTLR has a Python implementation, but Lark's design caused no friction. The grammar file is self-contained, uses standard EBNF, and produces a parse tree you can walk with a transformer or visitor pattern.

Significant Indentation in 3 Rules

EHQL uses Python-style indentation for grouping. Lark handles this with %declare _INDENT _DEDENT and a custom _NEWLINE terminal that matches trailing whitespace and comments. The _history_body rule requires newlines after each pattern, with _INDENT and _DEDENT terminals automatically generated by Lark's lexer when indentation changes. This two-stage approach (lexing raw characters into tokens, then parsing tokens into a tree) makes indentation handling clean and testable.

The same technique supports inline comments like -- this is a comment because SQL_COMMENT is part of the _NEWLINE definition. Lark ignores comments during lexing but preserves their position for error reporting.

What This Unlocks

Nchammas's approach scales to any domain where your data already sits in Parquet on a central catalog and analysts need constrained queries. By restricting the grammar to a few patterns, you avoid the complexity of SQL while retaining the power of Spark's distributed execution. The next step is adding aggregations and temporal filters - all achieved by extending the grammar and writing transformation functions. No custom engine required.

That is the whole point: compile your domain language down to Spark, not reinvent Spark itself.

Source: Implementing a Custom Query Language with Python and Apache Spark
Domain: nchammas.com