I have the code:
loader = PyPDFLoader(“https://arxiv.org/pdf/2303.08774.pdf”)
data = loader.load()
docs = text_splitter1.split_documents(data)
vector_search_index = “vector_index”
vector_search = MongoDBAtlasVectorSearch.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(disallowed_special=()),
collection=atlas_collection,
index_name=vector_search_index,
)
query = "What were the compute requirements for training GPT 4"
results = vector_search1.similarity_search(query)
print("result: ", results)
And in results I have every time only empty array. I don't understand what I do wrong. This is the link on the langchain documentation with examples. Information is saved normally in database, but I cannot search info in this collection.
So I was able to get this to work in MongoDB with the following code:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
data = loader.load()
docs = text_splitter.split_documents(data)
DB_NAME = "langchain_db"
COLLECTION_NAME = "atlas_collection"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "vector_index"
MONGODB_ATLAS_CLUSTER_URI = uri = os.environ.get("MONGO_DB_ENDPOINT")
client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)
MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]
vector_search = MongoDBAtlasVectorSearch.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(disallowed_special=()),
collection=MONGODB_COLLECTION,
index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)
query = "What were the compute requirements for training GPT 4"
results = vector_search.similarity_search(query)
print("result: ", results)
At this point, I did get the same results that you did. Before it would work, I had to create the vector search index and I made sure it was named the same as what is specified in ATLAS_VECTOR_SEARCH_INDEX_NAME:

FWIW - It was easier for me to do in Astra DB (I tried this first, because I am a DataStax employee):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
data = loader.load()
docs = text_splitter.split_documents(data)
atlas_collection = "atlas_collection"
ASTRA_DB_API_ENDPOINT = os.environ.get("ASTRA_DB_API_ENDPOINT")
ASTRA_DB_APPLICATION_TOKEN = os.environ.get("ASTRA_DB_APPLICATION_TOKEN")
vector_search = AstraDBVectorStore.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(disallowed_special=()),
collection_name=atlas_collection,
api_endpoint=ASTRA_DB_API_ENDPOINT,
token=ASTRA_DB_APPLICATION_TOKEN,
)
query = "What were the compute requirements for training GPT 4"
results = vector_search.similarity_search(query)
print("result: ", results)
Worth noting, that Astra DB will create your vector index automatically based on the dimensions of the embedding model.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With