Securely indexing large codebases
Article Summary
Indexing large codebases efficiently and securely is a major challenge. Cursor addresses this with a multi-layered approach involving Merkle Trees for tracking granular file changes, Simhashes for finding matching indexes from teammates to reuse, and cryptographic Content Proofs to ensure that data is never leaked to a client that doesn't already possess the underlying files. This combination reduces indexing time from hours to seconds for new users while maintaining strict security boundaries.
Visual Story
Cover
By securely reusing a teammate's existing index, Cursor cuts time-to-first-query from hours to seconds on the largest repos.
Page 1: The Mountain of Code
The indexing pipeline normally requires uploading every file when a codebase is new to Cursor. For large repositories, computing embeddings for every chunk is an expensive step that can take a significant amount of time.
Page 2: The Merkle Tree
Cursor builds its first view of a codebase using a Merkle tree. This allows it to detect exactly which files and directories have changed by comparing cryptographic hashes, syncing only the specific branches where hashes differ.
Page 3: The Shortcut (Simhash)
When a new user joins, the client computes a similarity hash (simhash) of their codebase. The server uses this to find an existing index from a teammate that matches, allowing the new user to reuse that index instead of building one from scratch.
Page 4: Content Proofs
To guarantee security, the server uses Content Proofs. It checks the client's Merkle tree hashes against the stored index. If the client cannot prove it has the file content (via the hash), the server drops the result, preventing data leaks.
Page 5: Fast & Secure
This approach allows the client to query immediately and see results only for code it shares with the copied index. It dramatically reduces time-to-first-query from hours to seconds while maintaining strict access control.
Generated by Michi Manga
Last updated