top of page
Search

Document Chunking for RAG systems and, by extension, Generative AI applications

  • Writer: Oktay Sahinoglu
    Oktay Sahinoglu
  • 2 days ago
  • 2 min read

Updated: 2 days ago

ree

The maximum sequence length of Language Models, while not a hard-coded limitation, is a parameter that significantly impacts their performance. Consequently, document chunking has become one of the critical operations directly affecting the performance of RAG systems and, by extension, Generative AI applications.


The chunking approaches we commonly see in practical applications—those that simply split at maximum sequence length—often produce suboptimal results. Why? Because the token where we reach maximum length might land in the middle of a sentence or even a word. This breaks semantic coherence and inevitably degrades output quality. Additionally, with large datasets, this process can be time-consuming. Therefore, achieving both fast and effective chunking requires attention to important nuances.


Let’s look at a well-designed implementation example below, and then walk through its main points together.



Key takeaways

Split content into sentences: To preserve semantic coherence, always split your content into sentences as the first step. Since this example is prepared for Turkish content, I used Zemberek as the language tool for sentence extraction. You can use the appropriate tools for whatever language you're working with.


Set a maximum sentence length (Optional): Content may contain unnecessarily long sentences that even language models would struggle to understand. If you want to filter these out, you can specify a maximum length for sentences (max_sentence_tokens).


Choose your splitting strategy: When splitting content, you need to decide: will you take chunks up to the model's maximum length and accept whatever remains at the end, or do you want your chunks to be of similar lengths? I used the split_with_max parameter to control this option.


Build chunks sentence by sentence: Construct your chunks by adding sentences until you reach your target length (max_token_size or num_tokens_per_part). When you exceed the limit, leave that sentence for the next chunk.


Always use the dataset.map method: If you want to perform chunking at high speed by running your tokenizer efficiently with multi-processing, be sure to use dataset.map. To do this, convert your chunking logic into a function that can be passed to the map method.


Use the streaming feature for large datasets: If you're working with large datasets, read your data using the load_dataset method from the datasets library with streaming set to True. This will significantly optimize your memory usage.


In this implementation, since I wanted to create chunks of similar length using the split_with_max parameter, I didn't use overlap. However, if you prefer, you can configure it so that a certain number of sentences at the end of one chunk overlap with the same number of initial sentences in the next chunk. Note that when using overlap, we can't set split_with_max to False, because since we don't know in advance which parts will overlap, we calculate based on maximum length rather than splitting into similar lengths.


Hope you like it. See you at another post.

 
 
 

Comments


© 2020  Oktay Sahinoglu.

bottom of page