Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast text search in over 600,000 files

I have a php, linux server. It has a folder called notes_docs which contains over 600,000 txt files. The folder structure of notes_docs is as follows -

 - notes_docs
   - files_txt
     - 20170831
           - 1_837837472_abc_file.txt
           - 1_579374743_abc2_file.txt
           - 1_291838733_uridjdh.txt
           - 1_482737439_a8weele.txt
           - 1_733839474_dejsde.txt
     - 20170830
     - 20170829

I have to provide a fast text search utility which can show results on browser. So if my user searches for "new york", all the files which have "new york" in them, should be returned in an array. If user searches for "foo", all files with "foo" in them should be returned.

I already tried the code using scandir, and Directory Iterator, which is too slow. It is taking more than a minute to search, even then the search was not complete. I tried ubuntu find which was again slow taking over a minute to complete. because there are too many folder iterations, and notes_docs current size is over 20 GB.

Any solution which I can use to make it faster are welcome. I can make design changes, integrate my PHP code to curl to another language code. I can make infrastructure changes too in extreme cases (as in using in memory something).

I want to know how people in industry do this? People at Indeed, Zip Recruiter all provide file search.

Please note I have 2GB - 4GB to RAM, so loading all the files on RAM all the time is not acceptable.

EDIT - All the below inputs are great. For those who come later, We ended up using Lucene for indexing and text-search. It performed really well

like image 828
zookastos Avatar asked Sep 07 '17 20:09

zookastos


2 Answers

To keep it simple: There is no fast way to open, search and close 600k documents every time you want to do a search. Your benchmarks with "over a minute" are probably with single test accounts. If you plan to search these via a multi-user website, you can quickly forget about it, because your disk IO will be off the charts and block your entire server.

So your only options is to index all files. Just as every other quick search utility does. No matter if you use Solr or ElasticSearch as mentioned in the comments, or build something of your own. The files will be indexed.

Considering the txt files are text versions of pdf files you receive, I'm betting the easiest solution is to write the text to a database instead of a file. It won't take up much more disk space anyway.

Then you can enable full text search on your database (mysql, mssql and others support it) and I'm sure the response times will be a lot better. Keep in mind that creating these indexes do require storage space, but the same goes for other solutions.

Now if you really want to speed things up, you could try to parse the resumes on a more detailed level. Try and retrieve locations, education, spoken languages and other information you regularly search for and put them in separate tables/columns. This is a very difficult task and almost a project on it's own, but if you want a valuable search result, this is the way to go. Because searching in text without context gives very different results, just think of your example "new york":

  1. I live in New York
  2. I studied at New York University
  3. I love the song "new york" from Alicia Keys in a personal bio
  4. I worked for New York Pizza
  5. I was born in new yorkshire, UK
  6. I spent a summer breeding new yorkshire terriers.
like image 115
Hugo Delsing Avatar answered Oct 25 '22 02:10

Hugo Delsing


I won't go too deep, but I will attempt to provide guidelines for creating a proof-of-concept.

1

First Download and extract elastic search from here: https://www.elastic.co/downloads/elasticsearch and then run it:

bin/elasticsearch

2

Download https://github.com/dadoonet/fscrawler#download-fscrawler extract it and run it:

bin/fscrawler myCustomJob

Then stop it (Ctrl-C) and edit the corresponding myCustomJob/_settings.json (It has been created automatically and path was printed on the console).
You can edit the properties: "url" (path to be scanned), "update_rate" (you can make it 1m), "includes" (e.g. ["*.pdf","*.doc","*.txt"]), "index_content" (make it false, to only stay on the filename).

Run again:

bin/fscrawler myCustomJob

Note: Indexing is something you might later want to perform using code, but for now, it will be done automatically, using fscrawler, which directly talks to elastic.

3

Now start adding files to the directory that you specified in "url" property.

4

Download advanced rest client for chrome and make the following POST:

URL: http://localhost:9200/_search

Raw payload:

{
  "query": { "wildcard": {"file.filename":"aFileNameToSearchFor*"} }
}

You will get back the list of matched files. Note: fscrawler is indexing the filenames under key: file.filename.

5

Now, instead of using advanced rest client you can use PHP, to perform this query. Either by a REST call to the url above, or by utilizing the php-client api: https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/_search_operations.html

The same stands for indexing: https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/_indexing_documents.html

like image 14
Marinos An Avatar answered Oct 25 '22 04:10

Marinos An