Pass@1 for ATL model transformations sits at zero improvement across every prompting strategy and every LLM tested. That single number captures the gap between syntactic polish and semantic understanding in LLM-generated code.
Few-Shot Prompting: Syntax Up, Semantics Stuck
The LLM4MTLs workflow from the paper systematically builds prompt constructions that combine few-shot examples, grammar prompting, and helper method inclusion. Across four model transformation languages (ATL, ETL, QVTo, and the Reactions language) and three LLMs, the results are clean: few-shot prompting consistently lifts syntactic quality for every language. The generated code is more likely to parse and conform to the target metamodel.
But semantic correctness tells a different story. Gains are uneven and heavily language-dependent. For ATL, no strategy budges Pass@1 at all. The authors note that few-shot prompting improves surface-level syntax more readily than deep transformation semantics. You get structurally valid code that still does the wrong thing.
Grammar Prompting Needs a Partner
Grammar prompting stabilizes code generation when combined with few-shot examples - the LLM stays within the syntactic guardrails. In isolation, grammar prompting can be ineffective or even counterproductive for certain model-language combinations. The authors found that including helper methods as a complementary amplifier can help, but only in specific contexts.
LLM choice itself influences syntactic correctness and similarity for ETL and QVTo, but its influence on semantic correctness remains limited across the board. No model magically understands transformation logic better.
What This Means for Domain-Specific Code Gen
The paper delivers a controlled empirical result that matches what anyone who has tried to get an LLM to write complex transformations has felt: you can fix the shape of the output, but not the reasoning. The evaluation suite covering four MTLs with executable reference scripts and manually written test suites is a solid contribution - it lets future work measure against a real baseline.
For now, the takeaway is straightforward: if you need a syntactically valid first draft of a model transformation, few-shot prompting is your friend. If you need the transformation to actually be correct, you still need to check every line.
Source: LLM4MTLs: Automated Generation and Empirical Evaluation of Model Transformation Languages
Domain: arxiv.org
Comments load interactively on the live page.