What does langchain CharacterTextSplitter's chunk_size param even do?

Question

My default assumption was that the chunk_size parameter would set a ceiling on the size of the chunks/splits that come out of the split_text method, but that's clearly not right:

from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

chunk_size = 6
chunk_overlap = 2

c_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

text = 'abcdefghijklmnopqrstuvwxyz'

c_splitter.split_text(text)

prints: ['abcdefghijklmnopqrstuvwxyz'], i.e. one single chunk that is much larger than chunk_size=6.

So I understand that it didn't split the text into chunks because it never encountered the separator. But so then the question is what is the chunk_size even doing?

I checked the documentation page for langchain.text_splitter.CharacterTextSplitter here but did not see an answer to this question. And I asked the "mendable" chat-with-langchain-docs search functionality, but got the answer "The chunk_size parameter of the CharacterTextSplitter determines the maximum number of characters in each chunk of text."...which is not true, as the code sample above shows.

DMcC · Accepted Answer

CharacterTextSplitter will only split on separator (which is ' ' by default). chunk_size is the maximum chunk size that will be split if splitting is possible. If a string starts with n characters, has a separator, and has m more characters before the next separator then the first chunk size will be n if chunk_size < n + m + len(separator).

Your example string has no matching separators so there's nothing to split on.

Basically, it attempts to make chunks that are <= chunk_size, but will still produce chunks > chunk_size if the minimum size chunks that can be created are > chunk_size.

Yilmaz · Answer

CharacterTextSpliiter behaves differently from what you expected.

text_splitter = CharacterTextSplitter(
    separator="
",
    chunk_size=6,
)

It first looks for the first 6 characters and then splits the next chunk from the closest separator, not from the 7th character.

As stated in the docs default separator is " ".

This is the simplest method. This splits based on characters (by default " ") and measure chunk length by number of characters.

you can test the behaviour with a sample code. first create a test.txt file with this

1.Respect for Others: Treat others with kindness.
2.Honesty and Integrity: Be truthful and act with integrity in your interactions with others.
3.Fairness and Justice: Treat people equitably.
4.Respect for Property: Respect public and private property.
5.Good Citizenship: Contribute positively to your community by obeying laws, voting, volunteering, and supporting communal well-being.

then write this code:

from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

# it will first find first 20 character then it will make the next chunk at the closest separator
text_splitter = CharacterTextSplitter(
    separator="
",
    chunk_size=20,
    chunk_overlap=0
)

loader = TextLoader("test.txt")
docs = loader.load_and_split(
    text_splitter=text_splitter
)

for doc in docs:
    print(doc.page_content)
    print("
")

this is how it look like:

enter image description here

What does langchain CharacterTextSplitter's chunk_size param even do?

Tags:

python

text

machine-learning

nlp

langchain

Max Power

2 Answers

DMcC

Yilmaz

Recent Activity

Donate For Us

What does langchain CharacterTextSplitter's chunk_size param even do?

Tags:

python

text

machine-learning

nlp

langchain

Max Power

2 Answers

DMcC

Yilmaz

Related questions

Recent Activity

Donate For Us