Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Most effective way to store every word in a document separately

Here's my situation (or see TLDR at bottom): I'm trying to make a system that will search for user entered words through several documents and return the documents that contain those words. The user(s) will be searching through thousands of documents, each of which will be 10 - 100+ pages long, and stored on a webserver.

The solution I have right now is to store each unique word in a table with an ID (only maybe 120 000 relevant words in the English language), and then in a separate table store the word id, the document it is in, and the number of times it appears in that document.

E.g: Document foo's text is

abc abc def

and document bar's text is

abc def ghi

Documents table will have

id | name

1 'foo'
2 'bar'

Words table:

id | word

1 'abc'
2 'def'
3 'ghi'

Word Document table:

word id | doc id | occurrences

1        1        2
1        2        1
2        1        1
2        2        1
3        2        1

As you can see when you have thousands of documents and each has thousands of unique words, the Word Document tables blows up very quickly and takes way too long to search through.

TL;DR My question is this:

How can I store searchable data from large documents in an SQL database, while retaining the ability to use my own search algorithm (I am aware SQL has one built in for .docs and pdfs) based on custom factors (like occurrence, as well as others) without having an outright massive table for all the entries linking each word to a document and its properties in that document?

Sorry for the long read and thanks for any help!

like image 265
Roman Avatar asked Dec 05 '13 21:12

Roman


Video Answer


2 Answers

Rather than building your own search engine using SQL Server, have you considered using a C# .net implementation of the lucene search api's? Have a look at https://github.com/apache/lucene.net

like image 176
Jamie Clayton Avatar answered Sep 28 '22 15:09

Jamie Clayton


Good question. I would piggy back on the existing solution of SQL Server (full text indexing). They have integrated a nice indexing engine which optimises considerably better than your own code probably could do (or the developers at Microsoft are lazy or they just got a dime to build it :-)

Please see SQL server text indexing background. You could query views such as sys.fulltext_index_fragments or use stored procedures.

Ofcourse, piggy backing on an existing solution has some draw backs:

  1. You need to have a license for the solution.
  2. When your needs can no longer be served, you will have to program it all yourself.

But if you allow SQL Server to do the indexing, you could more easily and with less time build your own solution.

like image 38
Guido Leenders Avatar answered Sep 28 '22 13:09

Guido Leenders