I have a php, linux server. It has a folder called notes_docs
which contains over 600,000 txt files. The folder structure of notes_docs
is as follows -
- notes_docs
- files_txt
- 20170831
- 1_837837472_abc_file.txt
- 1_579374743_abc2_file.txt
- 1_291838733_uridjdh.txt
- 1_482737439_a8weele.txt
- 1_733839474_dejsde.txt
- 20170830
- 20170829
I have to provide a fast text search utility which can show results on browser. So if my user searches for "new york", all the files which have "new york" in them, should be returned in an array. If user searches for "foo", all files with "foo" in them should be returned.
I already tried the code using scandir
, and Directory Iterator
, which is too slow. It is taking more than a minute to search, even then the search was not complete. I tried ubuntu find
which was again slow taking over a minute to complete. because there are too many folder iterations, and notes_docs
current size is over 20 GB.
Any solution which I can use to make it faster are welcome. I can make design changes, integrate my PHP code to curl to another language code. I can make infrastructure changes too in extreme cases (as in using in memory something).
I want to know how people in industry do this? People at Indeed, Zip Recruiter all provide file search.
Please note I have 2GB - 4GB to RAM, so loading all the files on RAM all the time is not acceptable.
EDIT - All the below inputs are great. For those who come later, We ended up using Lucene for indexing and text-search. It performed really well
To keep it simple: There is no fast way to open, search and close 600k documents every time you want to do a search. Your benchmarks with "over a minute" are probably with single test accounts. If you plan to search these via a multi-user website, you can quickly forget about it, because your disk IO
will be off the charts and block your entire server.
So your only options is to index all files. Just as every other quick search utility does. No matter if you use Solr or ElasticSearch as mentioned in the comments, or build something of your own. The files will be indexed.
Considering the txt
files are text versions of pdf
files you receive, I'm betting the easiest solution is to write the text to a database instead of a file. It won't take up much more disk space anyway.
Then you can enable full text search
on your database (mysql
, mssql
and others support it) and I'm sure the response times will be a lot better. Keep in mind that creating these indexes
do require storage space, but the same goes for other solutions.
Now if you really want to speed things up, you could try to parse the resumes on a more detailed level. Try and retrieve locations, education, spoken languages and other information you regularly search for and put them in separate tables/columns. This is a very difficult task and almost a project on it's own, but if you want a valuable search result, this is the way to go. Because searching in text without context gives very different results, just think of your example "new york":
I won't go too deep, but I will attempt to provide guidelines for creating a proof-of-concept.
First Download and extract elastic search from here: https://www.elastic.co/downloads/elasticsearch and then run it:
bin/elasticsearch
Download https://github.com/dadoonet/fscrawler#download-fscrawler extract it and run it:
bin/fscrawler myCustomJob
Then stop it (Ctrl-C) and edit the corresponding myCustomJob/_settings.json
(It has been created automatically and path was printed on the console).
You can edit the properties: "url"
(path to be scanned),
"update_rate"
(you can make it 1m
),
"includes"
(e.g. ["*.pdf","*.doc","*.txt"]
), "index_content"
(make it false, to only stay on the filename).
Run again:
bin/fscrawler myCustomJob
Note: Indexing is something you might later want to perform using code, but for now, it will be done automatically, using fscrawler
, which directly talks to elastic.
Now start adding files to the directory that you specified in "url"
property.
Download advanced rest client for chrome and make the following POST
:
URL: http://localhost:9200/_search
Raw payload:
{
"query": { "wildcard": {"file.filename":"aFileNameToSearchFor*"} }
}
You will get back the list of matched files. Note: fscrawler
is indexing the filenames under key: file.filename
.
Now, instead of using advanced rest client you can use PHP, to perform this query. Either by a REST call to the url above, or by utilizing the php-client api: https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/_search_operations.html
The same stands for indexing: https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/_indexing_documents.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With