Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does langchain CharacterTextSplitter's chunk_size param even do?

My default assumption was that the chunk_size parameter would set a ceiling on the size of the chunks/splits that come out of the split_text method, but that's clearly not right:

from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

chunk_size = 6
chunk_overlap = 2

c_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

text = 'abcdefghijklmnopqrstuvwxyz'

c_splitter.split_text(text)

prints: ['abcdefghijklmnopqrstuvwxyz'], i.e. one single chunk that is much larger than chunk_size=6.

So I understand that it didn't split the text into chunks because it never encountered the separator. But so then the question is what is the chunk_size even doing?

I checked the documentation page for langchain.text_splitter.CharacterTextSplitter here but did not see an answer to this question. And I asked the "mendable" chat-with-langchain-docs search functionality, but got the answer "The chunk_size parameter of the CharacterTextSplitter determines the maximum number of characters in each chunk of text."...which is not true, as the code sample above shows.

like image 761
Max Power Avatar asked Dec 02 '25 20:12

Max Power


2 Answers

CharacterTextSplitter will only split on separator (which is '\n\n' by default). chunk_size is the maximum chunk size that will be split if splitting is possible. If a string starts with n characters, has a separator, and has m more characters before the next separator then the first chunk size will be n if chunk_size < n + m + len(separator).

Your example string has no matching separators so there's nothing to split on.

Basically, it attempts to make chunks that are <= chunk_size, but will still produce chunks > chunk_size if the minimum size chunks that can be created are > chunk_size.

like image 89
DMcC Avatar answered Dec 04 '25 10:12

DMcC


CharacterTextSpliiter behaves differently from what you expected.

text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=6,
)

It first looks for the first 6 characters and then splits the next chunk from the closest separator, not from the 7th character.

As stated in the docs default separator is "\n".

This is the simplest method. This splits based on characters (by default "\n\n") and measure chunk length by number of characters.

you can test the behaviour with a sample code. first create a test.txt file with this

1.Respect for Others: Treat others with kindness.
2.Honesty and Integrity: Be truthful and act with integrity in your interactions with others.
3.Fairness and Justice: Treat people equitably.
4.Respect for Property: Respect public and private property.
5.Good Citizenship: Contribute positively to your community by obeying laws, voting, volunteering, and supporting communal well-being.

then write this code:

from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

# it will first find first 20 character then it will make the next chunk at the closest separator
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=20,
    chunk_overlap=0
)

loader = TextLoader("test.txt")
docs = loader.load_and_split(
    text_splitter=text_splitter
)

for doc in docs:
    print(doc.page_content)
    print("\n")

this is how it look like:

enter image description here

like image 41
Yilmaz Avatar answered Dec 04 '25 10:12

Yilmaz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!