Source linked

Lexical MLT sigue derrotando las incorporaciones para códigos de error y SKU

manticoresearch.com@systems_wire3 days ago·Developer Tools·8 comments

Classic More Like This usando TF-IDF y BM25 sigue siendo superior para coincidencias exactas como códigos de error y SKU de producto, mientras que las incorporaciones brillan para la similitud semántica - los sistemas híbridos ganan.

manticoresearchvector searchlexical searchtf idfbm25hybrid search

For error codes like ERR_404 and product SKUs, lexical More Like This still beats vector search because exact matches matter more than semantic similarity. That's the core argument from Manticore Search's Sergey Nikolaev, who traces the evolution from TF-IDF-based MLT to embedding-driven approaches and lands on a hybrid conclusion.

What Classic MLT Does That Embeddings Can't

Lexical MLT answers a simple question: which documents share the same important words? It uses the inverted index already in your search engine, running TF-IDF or BM25 against tokenized fields. Parameters like min_term_freq, min_doc_freq, and max_query_terms control which terms are selected from the source document. There's no extra infrastructure.

This approach handles exact matches naturally. Two incident reports with the same stack trace or ERR_404 error code are directly linked. Vector search, by contrast, may return tickets describing similar but not identical problems — semantically close, wrong for triage. Lexical search also thrives on product SKUs, legal wording, function names, and near-duplicate detection. It's cheap, transparent, and deterministic.

Where Semantic Search Fills the Gap

If two documents say the same thing in different words — "memory leak" vs "unbounded heap growth" — lexical MLT sees different tokens and misses the connection. Embeddings solve this by representing each document as a dense vector; nearby vectors correspond to semantically similar content, regardless of phrasing.

That opens up use cases beyond text: products, images, code fragments, user events, and RAG context retrieval. But the trade-off is clear: vector search sacrifices precision on exact identifiers to gain recall on paraphrases. Manticore's own comparison of lexical and vector search confirms that the former wins on strict matches, the latter on coverage of semantic relationships.

Hybrid Search as the Production Default

Production systems shouldn't choose — they should combine. Hybrid search runs full-text and vector joins in one query, filters constrain the space, and a reranking step refines the final order. Exact error codes and SKUs come from the lexical branch; semantically related results fill in where wording differs.

This isn't a theoretical exercise. Manticore points to real scenarios: support-ticket matching, legal search, patent research, and knowledge bases all benefit from both modes. Lexical search doesn't disappear — it handles the precise identifiers that embeddings ignore. Any production system that drops lexical MLT for exact-match use cases is leaving precision on the table. Manticore's blueprint for hybrid search is the practical way forward.


Source: The Evolution of 'More Like This'
Domain: manticoresearch.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.