I made an app to fuzzy-deduplicate my Google Sheets and CRM records

- No manual configuration required - Works out-of-the-box on most data types (ex. people, companies, product catalog)

Implementation details:

- Embeds records using an E5 model - Performs similarity search using DuckDB w/ vector similarity extension - Does last-mile comparison and merges duplicates using Claude

Demo video: https://youtu.be/7mZ0kdwXBwM

Github repo (Apache 2.0 licensed): https://github.com/SnowPilotOrg/dedupe_it

Lmk any feedback on how to make this better!

  • OliverGilan 18 hours ago |
    Curious how this scales. Just tried this with the test dataset and it was probably the slickest deduplication experience I’ve had
    • remolacha 18 hours ago |
      Appreciate the kind words! Linear scaling in terms of speed and cost. We haven't yet optimized the prompts & choice of model to minimize token usage, so I'd recommend emailing us for advice if you want to run this on a large dataset