Show HN: Chonkie – A Fast, Lightweight Text Chunking Library for RAG

bhavnicksm · Nov 10, 2024

I built Chonkie because I was tired of rewriting chunking code for RAG applications. Existing libraries were either too bloated (80MB+) or too basic, with no middle ground.
Core features:

21MB default install vs 80-171MB alternatives
33x faster token chunking than popular alternatives
Supports multiple chunking strategies: token, word, sentence, and semantic
Works with all major tokenizers (transformers, tokenizers, tiktoken)
Zero external dependencies for basic functionality

Technical optimizations:

Uses tiktoken with multi-threading for faster tokenization
Implements aggressive caching and precomputation
Running mean pooling for efficient semantic chunking
Modular dependency system (install only what you need)

Benchmarks and code: GitHub - bhavnicksm/chonkie:

CHONK your texts with Chonkie

- The no-nonsense RAG chunking library
Looking for feedback on the architecture and performance optimizations. What other chunking strategies would be useful for RAG applications?

Comments URL: Show HN: Chonkie – A Fast, Lightweight Text Chunking Library for RAG | Hacker News

Points: 60

# Comments: 19

Continue reading...