Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to read text data file using TextLoader from langchain.document_loaders library because of encoding issue

My end goal is to read the contents of a file and create a vectorstore of my data which I can query later.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader


loader = TextLoader("elon_musk.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

It looks like there is some issue with my data file and because of this, it is not able to read the contents of my file. Is it possible to load my file in utf-8 format? My assumption is with utf-8 encoding I should not face this issue.

Following is the error I am getting in my code:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
File ~\anaconda3\envs\langchain-test\lib\site-packages\langchain\document_loaders\text.py:41, in TextLoader.load(self)
     40     with open(self.file_path, encoding=self.encoding) as f:
---> 41         text = f.read()
     42 except UnicodeDecodeError as e:

File ~\anaconda3\envs\langchain-test\lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
     22 def decode(self, input, final=False):
---> 23     return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1897: character maps to <undefined>

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Cell In[1], line 8
      4 from langchain.document_loaders import TextLoader
      7 loader = TextLoader("elon_musk.txt")
----> 8 documents = loader.load()
      9 text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
     10 docs = text_splitter.split_documents(documents)

File ~\anaconda3\envs\langchain-test\lib\site-packages\langchain\document_loaders\text.py:54, in TextLoader.load(self)
     52                 continue
     53     else:
---> 54         raise RuntimeError(f"Error loading {self.file_path}") from e
     55 except Exception as e:
     56     raise RuntimeError(f"Error loading {self.file_path}") from e

RuntimeError: Error loading elon_musk.txt

How to resolve this?

like image 833
Rito Avatar asked Oct 18 '25 23:10

Rito


1 Answers

I just had this same problem. Code worked fine in Colab (Unix), but not in VS code. Tried Marc's suggestions to no avail. Checked that VSCode preference was UTF-8 for encoding. Verified that the files were exactly the same on both machines. Even ensured they had the same python version!

Here is what worked for me. When using TextLoader, do it like this:

loader = TextLoader("elon_musk.txt", encoding = 'UTF-8')

When using DirectoryLoader, instead of this:

loader = DirectoryLoader("./new_articles/", glob="./*.txt",   loader_cls=TextLoader)

Do This:

text_loader_kwargs={'autodetect_encoding': True}
loader = DirectoryLoader("./new_articles/", glob="./*.txt", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
like image 140
Sid Johnson Avatar answered Oct 21 '25 07:10

Sid Johnson



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!