I have a collection of text files containing anonymised medical data (age, country, symptoms, diagnosis etc). This data goes back for at least 30 years so as you can imagine I have quite a large sized data set. In total I have around 20,000 text files totalling approx. 1TB. Periodically I will be needing to search these files for occurances of a particular string (not regex). What is the quickest way to search through this data? I have tried using grep and recursively searching through the directory as follows: <pre class="prettyprint"><code>LC_ALL=C fgrep -r -i "searchTerm" /Folder/Containing/Files </code></pre> The only problem with doing the above is that it takes hours (sometimes half a day!) to search through this data. Is there a quicker way to search through this data? At this moment I am open to different approaches such as databases, elasticsearch etc. If I do go down the database route, I will have approx. 1 billion records. My only requirements are: 1) The search will be happening on my local computer (Dual-Core CPU and 8GB RAM) 2) I will be searching for strings (not regex). 3) I will need to see all occurances of the search string and the file it was within.

There are a lot of answers already, I just wanted to add my two cents: <ol> <li>Having this much huge data(1 TB) with just 8 GB of memory will not be good enough for any approach, be it using the Lucene or Elasticsearch(internally uses Lucene) or some grep command if you want faster search, the reason being very simple all these systems hold the data in fastest memory to be able to serve faster and out of 8 GB(25% you should reserve for OS and another 25-50% at least for other application), you are left with very few GB of RAM.</li> <li>Upgrading the SSD, increasing RAM on your system will help but it's quite cumbersome and again if you hit performance issues it will be difficult to do vertical scaling of your system.</li> </ol> Suggestion <ol> <li>I know you already mentioned that you want to do this on your system but as I said it wouldn't give any real benefit and you might end up wasting so much time(infra and code-wise(so many approaches as mentioned in various answers)), hence would suggest you do the top-down approach as mentioned in my another answer for determining the right capacity. It would help you to identify the correct capacity quickly of whatever approach you choose.</li> <li>About the implementation wise, I would suggest doing it with Elasticsearch(ES), as it's very easy to set up and scale, you can even use the AWS Elasticsearch which is available in free-tier as well and later on quickly scale, although I am not a big fan of AWS ES, its saves a lot of time of setting up and you can quickly get started if you are much familiar of ES.</li> <li>In order to make search faster, you can split the file into multiple fields(title,body,tags,author etc) and index only the important field, which would reduce the inverted index size and if you are looking only for exact string match(no partial or full-text search), then you can simply use the <code>keyword</code> field which is even faster to index and search.</li> <li>I can go on about why Elasticsearch is good and how to optimize it, but that's not the crux and Bottomline is that any search will need a significant amount of memory, CPU, and disk and any one of becoming bottleneck would hamper your local system search and other application, hence advising you to really consider doing this on external system and Elasticsearch really stands out as its mean for distributed system and most popular open-source search system today.</li> </ol>

How to speed up a search on large collection of text files (1TB)

Tags:

linux

grep

database

elasticsearch

I have a collection of text files containing anonymised medical data (age, country, symptoms, diagnosis etc). This data goes back for at least 30 years so as you can imagine I have quite a large sized data set. In total I have around 20,000 text files totalling approx. 1TB.

Periodically I will be needing to search these files for occurances of a particular string (not regex). What is the quickest way to search through this data?

I have tried using grep and recursively searching through the directory as follows:

LC_ALL=C fgrep -r -i "searchTerm" /Folder/Containing/Files

The only problem with doing the above is that it takes hours (sometimes half a day!) to search through this data.

Is there a quicker way to search through this data? At this moment I am open to different approaches such as databases, elasticsearch etc. If I do go down the database route, I will have approx. 1 billion records.

My only requirements are:

1) The search will be happening on my local computer (Dual-Core CPU and 8GB RAM)

2) I will be searching for strings (not regex).

3) I will need to see all occurances of the search string and the file it was within.

952

asked May 29 '20 22:05

Matt9Atkins

1 Answers

There are a lot of answers already, I just wanted to add my two cents:

Having this much huge data(1 TB) with just 8 GB of memory will not be good enough for any approach, be it using the Lucene or Elasticsearch(internally uses Lucene) or some grep command if you want faster search, the reason being very simple all these systems hold the data in fastest memory to be able to serve faster and out of 8 GB(25% you should reserve for OS and another 25-50% at least for other application), you are left with very few GB of RAM.
Upgrading the SSD, increasing RAM on your system will help but it's quite cumbersome and again if you hit performance issues it will be difficult to do vertical scaling of your system.

Suggestion

I know you already mentioned that you want to do this on your system but as I said it wouldn't give any real benefit and you might end up wasting so much time(infra and code-wise(so many approaches as mentioned in various answers)), hence would suggest you do the top-down approach as mentioned in my another answer for determining the right capacity. It would help you to identify the correct capacity quickly of whatever approach you choose.
About the implementation wise, I would suggest doing it with Elasticsearch(ES), as it's very easy to set up and scale, you can even use the AWS Elasticsearch which is available in free-tier as well and later on quickly scale, although I am not a big fan of AWS ES, its saves a lot of time of setting up and you can quickly get started if you are much familiar of ES.
In order to make search faster, you can split the file into multiple fields(title,body,tags,author etc) and index only the important field, which would reduce the inverted index size and if you are looking only for exact string match(no partial or full-text search), then you can simply use the keyword field which is even faster to index and search.
I can go on about why Elasticsearch is good and how to optimize it, but that's not the crux and Bottomline is that any search will need a significant amount of memory, CPU, and disk and any one of becoming bottleneck would hamper your local system search and other application, hence advising you to really consider doing this on external system and Elasticsearch really stands out as its mean for distributed system and most popular open-source search system today.

146

answered Sep 16 '22 16:09

Amit

Related questions
                            
                                High Kernel CPU when running multiple python programs
                            
                                Sharing one port among multiple node.js HTTP processes
                            
                                Can Android Studio use OpenJDK or does it require Oracle JDK on Linux?
                            
                                What's the best way to move a directory into place in a Makefile install?
                            
                                How can I call inlined machine code in Python on Linux?
                            
                                how can I detect whether a specific page is mapped in memory?
                            
                                What is the error of "linux unable to handle kernel paging request at ffffffff00000010"?
                            
                                Why is the maximal path length allowed for unix-sockets on linux 108?
                            
                                Why does malloc() call mmap() and brk() interchangeably?
                            
                                Packaging a single JAR file as an RPM properly
                            
                                Linux kernel - add system call dynamically through module
                            
                                Framebuffer Documentation
                            
                                How to set the Evil Bit on outgoing traffic
                            
                                what's the proper buffer size for 'write' function?
                            
                                The difference between arguments and options pertaining to the linux shell
                            
                                What is /dev/arandom?
                            
                                Could not insert 'nvidia_352': No such device
                            
                                How to run Linux Docker images on Windows Server 2016?
                            
                                Switch user without creating an intermediate process
                            
                                Do here-strings undergo word-splitting?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With