Source linked

Summarize Millions of Logs in One SQL Line: BigQuery AI.AGG()

A single SQL function now synthesizes unstructured text and images across millions of rows, using multi-level LLM aggregation to surface trends without manual review.

google cloudbigqueryai aggsqllarge language modelsdata analysis

Google Cloud's BigQuery just shipped AI.AGG(), a single SQL function that can summarize millions of rows of log messages, product descriptions, or images in one query—no external pipeline, no Python glue code.

How AI.AGG() Handles Millions of Rows Without Busting Context Windows

LLMs choke on massive inputs. AI.AGG() solves that by automatically dividing your rows into batches, aggregating each batch, then aggregating those intermediate results into a final answer. You never manually manage context windows. Each row must fit within the model's context window (otherwise it's skipped), but smaller rows give the batching algorithm more flexibility.

Token usage can spike because of the multi-level structure. Google's own advice: always reduce input tokens with LIMIT or pre-filtering before calling AI.AGG(). You can pin a specific model endpoint, like gemini-2.5-flash for short-form, or a fully-qualified URL for a global endpoint if your region doesn't support the desired model.

From Logs to Categories: Two Concrete Examples

The BigQuery engineering team used AI.AGG() on Apache Spark INFO logs (public Loghub dataset) to surface hidden inefficiencies like memory thrashing or clock drift—even though no FATAL error appeared. The prompt explicitly tells the model it can say "everything is fine," preventing hallucinated errors while hunting for real anomalies.

For structured categorization, AI.AGG() returns plaintext or JSON. A cymbal_pets (fictional pet supply) dataset demonstrates the full pipeline: first, AI.AGG() identifies product categories from names and descriptions as a JSON array. That array feeds into AI.CLASSIFY() to label every product. The combination runs in one SQL script—category discovery and labeling, no separate ML step. Multimodal support means you can pass image URIs from a Cloud Storage external object table and get categories from the visual content alone.

Production Gotchas: Token Budgets, NULL Rows, and Endpoint Selection

AI.AGG() automatically skips NULL input rows, but watch out for STRUCT fields: the function concatenates struct fields like CONCAT(), so a single NULL field in a struct makes the entire row NULL and drops it silently. Use IFNULL() to provide fallback strings—Google's example shows exactly that pattern for products missing descriptions.

Error handling yields partial results. Failed rows are excluded; you check job statistics to see how many rows were rejected, just like with AI.IF() and AI.SCORE(). The output is always a string, even if you prompt for JSON or Markdown—the database engine doesn't enforce formatting, so build schema validation downstream.

With AI.AGG() in preview, every BigQuery user can now build SQL-driven LLM pipelines that scale—expect to see this pattern embedded into data warehouse workflows within the next quarter.


Source: Synthesize the big picture and analyze trends with BigQuery's AI.AGG function
Domain: cloud.google.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.