Source linked

Slopo Uses Embedding Models to Hunt Non-Exact Code Duplicates Across Modules

Slopo's embedding-based CLI catches similar-but-not-identical code scattered across a codebase, ranking clusters by similarity and distance for AI-assisted refactoring.

rafal qaslopoembedding modelscode duplicationdeveloper toolscli

Slopo's embedding-based approach catches non-exact duplicates scattered across modules—the kind that slip past grep and frustrate human review. It targets the most harmful copy-paste: snippets written similarly but sitting far apart in the codebase, often in different directories or separated by hundreds of lines. Exact duplicates are easy; non-exact ones rot your architecture.

How Slopo Turns Embeddings Into Duplicate Clusters

Slopo works in three stages. First, slopo index parses your source files (Python, TypeScript, JavaScript, Java, Kotlin, C#, Go, Rust) into code units—function bodies, class definitions, blocks. Then slopo embed sends each unit to an embedding model provider like Voyage AI via LiteLLM compatibility. You set dimensions (Voyage works fine at 512d) and a similarity threshold. Finally slopo analyze finds pairs with close embeddings, ranks them by cosine similarity, and boosts the rank based on how far apart they sit in the codebase. The output is a cluster report—index.md with cluster details per file—ranked by duplicate likelihood.

What makes this different from jscpd or pmd-cpd? Those tools look for exact or near-exact token matches within a few lines. Slopo's embedding distance can surface two functions that share the same logic but use different variable names, different whitespace, or even different language syntax—as long as the semantic intent is close enough. The trade-off is explicit: similar code that does completely different things won't match, and not every close pair is a true duplicate. That's why the report is meant to be reviewed by an AI agent, not merged blindly.

Real Workflow: Incremental Indexing and Ignoring False Positives

Slopo fits into iterative development. Run slopo index only once; subsequent runs use incremental re-indexing to update changed files. After analysis, ask your AI coding agent to filter out clusters that aren't real duplicates. The agent writes the hashes of discarded clusters into slopo.ignore.txt. Re-run analysis and the ignored clusters vanish. Commit the ignore file and the config (without the API key) to share the baseline across the team. New or modified duplicates reappear automatically.

Configuration is a single YAML file created by slopo init. Critical parameters—source_dir, embedding_model, embedding_dimensions, body_node_count_threshold—are locked after first indexing, so choose wisely. Everything else, including similarity threshold and rerank threshold, is adjustable and defaults to sensible values. The API key goes into SLOPO_EMBEDDING_API_KEY env var or .env, never in the config.

What This Enables for Codebase Hygiene

Embedding-based duplication detection is not new in research, but Slopo packages it as a pragmatic CLI that generates actionable output for existing AI coding agents. The key insight is that you don't need a dedicated duplicate-detection model; a general code embedding model like Voyage Code works, and the two-stage filtering (cosine similarity + distance boost) keeps recall high without drowning you in false positives. The example report from Slopo's own source tree confirmed that its language parsers suffer from messy duplication—a dogfooding result that earns trust. With Slopo, you can commit an ignore list and refactor incrementally, turning a once-manual review process into a repeatable CI step.


Source: Show HN: CLI tool for detecting non-exact code duplication with embedding models
Domain: github.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.