I'm working with langchain and ChromaDb using python.
Now, I know how to use document loaders. For instance, the below loads a bunch of documents into ChromaDb:
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
from langchain.vectorstores import Chroma
db = Chroma.from_documents(docs, embeddings, persist_directory='db')
db.persist()
But what if I wanted to add a single document at a time? More specifically, I want to check if a document exists before I add it. This is so I don't keep adding duplicates.
If a document does not exist, only then do I want to get embeddings and add it.
How do I do this using langchain? I think I mostly understand langchain but have no idea how to do seemingly basic tasks like this.
I think there are better ways to do that but here's what I found after reading the library:
If you see the Chroma.from_documents() method, it takes the ids param.
def from_documents(
cls: Type[Chroma],
documents: List[Document],
embedding: Optional[Embeddings] = None,
ids: Optional[List[str]] = None, # <--------------- here
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
persist_directory: Optional[str] = None,
client_settings: Optional[chromadb.config.Settings] = None,
client: Optional[chromadb.Client] = None,
**kwargs: Any,
) -> Chroma:
Using this param you can set your predefined id for your documents. If you don't pass any ids, it will create some random ids. See the ref below from the langchain library:
# TODO: Handle the case where the user doesn't provide ids on the Collection
if ids is None:
ids = [str(uuid.uuid1()) for _ in texts]
So, the workaround here is you have to set some unique ids/keys for your individual documents while storing them. In my case, I used a unique URL for each document, convert it to hash, and passed them on id param. After that when you store documents again, check the store for each document if they exist in the DB and remove them from the docs (ref from your sample code), and finally call the Chroma.from_documents() with duplicate documents removed from the list. See the below sample with ref to your sample code.
# step 1: generate some unique ids for your docs
# step 2: check your Chroma DB and remove duplicates
# step 3: store the docs without duplicates
# assuming your docs ids are in the ids list and your docs are in the docs list
db = Chroma.from_documents(docs, embeddings, ids=ids, persist_directory='db')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With