Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up a search on large collection of text files (1TB)

I have a collection of text files containing anonymised medical data (age, country, symptoms, diagnosis etc). This data goes back for at least 30 years so as you can imagine I have quite a large sized data set. In total I have around 20,000 text files totalling approx. 1TB.

Periodically I will be needing to search these files for occurances of a particular string (not regex). What is the quickest way to search through this data?

I have tried using grep and recursively searching through the directory as follows:

LC_ALL=C fgrep -r -i "searchTerm" /Folder/Containing/Files

The only problem with doing the above is that it takes hours (sometimes half a day!) to search through this data.

Is there a quicker way to search through this data? At this moment I am open to different approaches such as databases, elasticsearch etc. If I do go down the database route, I will have approx. 1 billion records.

My only requirements are:

1) The search will be happening on my local computer (Dual-Core CPU and 8GB RAM)

2) I will be searching for strings (not regex).

3) I will need to see all occurances of the search string and the file it was within.

like image 952
Matt9Atkins Avatar asked May 29 '20 22:05

Matt9Atkins


People also ask

Which of the following allows you to quickly search record in a large database?

Indexing is a data structure technique that allows you to quickly retrieve records from a database file.

How do I find large text files?

To be able to open such large CSV files, you need to download and use a third-party application. If all you want is to view such files, then Large Text File Viewer is the best choice for you. For actually editing them, you can try a feature-rich text editor like Emacs, or go for a premium tool like CSV Explorer.

How can I open a text file larger than 1gb?

Solution 1: Download a Dedicated Large File Viewer If all you need to do is read the large file, you can download a dedicated large file viewer such as the Large Text File Viewer. Such tools will open large text files with ease.


1 Answers

There are a lot of answers already, I just wanted to add my two cents:

  1. Having this much huge data(1 TB) with just 8 GB of memory will not be good enough for any approach, be it using the Lucene or Elasticsearch(internally uses Lucene) or some grep command if you want faster search, the reason being very simple all these systems hold the data in fastest memory to be able to serve faster and out of 8 GB(25% you should reserve for OS and another 25-50% at least for other application), you are left with very few GB of RAM.
  2. Upgrading the SSD, increasing RAM on your system will help but it's quite cumbersome and again if you hit performance issues it will be difficult to do vertical scaling of your system.

Suggestion

  1. I know you already mentioned that you want to do this on your system but as I said it wouldn't give any real benefit and you might end up wasting so much time(infra and code-wise(so many approaches as mentioned in various answers)), hence would suggest you do the top-down approach as mentioned in my another answer for determining the right capacity. It would help you to identify the correct capacity quickly of whatever approach you choose.
  2. About the implementation wise, I would suggest doing it with Elasticsearch(ES), as it's very easy to set up and scale, you can even use the AWS Elasticsearch which is available in free-tier as well and later on quickly scale, although I am not a big fan of AWS ES, its saves a lot of time of setting up and you can quickly get started if you are much familiar of ES.

  3. In order to make search faster, you can split the file into multiple fields(title,body,tags,author etc) and index only the important field, which would reduce the inverted index size and if you are looking only for exact string match(no partial or full-text search), then you can simply use the keyword field which is even faster to index and search.

  4. I can go on about why Elasticsearch is good and how to optimize it, but that's not the crux and Bottomline is that any search will need a significant amount of memory, CPU, and disk and any one of becoming bottleneck would hamper your local system search and other application, hence advising you to really consider doing this on external system and Elasticsearch really stands out as its mean for distributed system and most popular open-source search system today.
like image 146
Amit Avatar answered Sep 16 '22 16:09

Amit