My end goal is to read the contents of a file and create a vectorstore of my data which I can query later.
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader
loader = TextLoader("elon_musk.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
It looks like there is some issue with my data file and because of this, it is not able to read the contents of my file. Is it possible to load my file in utf-8 format? My assumption is with utf-8 encoding I should not face this issue.
Following is the error I am getting in my code:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
File ~\anaconda3\envs\langchain-test\lib\site-packages\langchain\document_loaders\text.py:41, in TextLoader.load(self)
40 with open(self.file_path, encoding=self.encoding) as f:
---> 41 text = f.read()
42 except UnicodeDecodeError as e:
File ~\anaconda3\envs\langchain-test\lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1897: character maps to <undefined>
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
Cell In[1], line 8
4 from langchain.document_loaders import TextLoader
7 loader = TextLoader("elon_musk.txt")
----> 8 documents = loader.load()
9 text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
10 docs = text_splitter.split_documents(documents)
File ~\anaconda3\envs\langchain-test\lib\site-packages\langchain\document_loaders\text.py:54, in TextLoader.load(self)
52 continue
53 else:
---> 54 raise RuntimeError(f"Error loading {self.file_path}") from e
55 except Exception as e:
56 raise RuntimeError(f"Error loading {self.file_path}") from e
RuntimeError: Error loading elon_musk.txt
How to resolve this?
I just had this same problem. Code worked fine in Colab (Unix), but not in VS code. Tried Marc's suggestions to no avail. Checked that VSCode preference was UTF-8 for encoding. Verified that the files were exactly the same on both machines. Even ensured they had the same python version!
Here is what worked for me. When using TextLoader, do it like this:
loader = TextLoader("elon_musk.txt", encoding = 'UTF-8')
When using DirectoryLoader, instead of this:
loader = DirectoryLoader("./new_articles/", glob="./*.txt", loader_cls=TextLoader)
Do This:
text_loader_kwargs={'autodetect_encoding': True}
loader = DirectoryLoader("./new_articles/", glob="./*.txt", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With