Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking up a "key" in an 8GB+ text file

I have some 'small' text files that contain about 500000 entries/rows. Each row has also a 'key' column. I need to find this keys in a big file (8GB, at least 219 million entries). When found, I need to append the 'Value' from the big file into the small file, at the end of the row as a new column.

The big file that is like this:

KEY                 VALUE
"WP_000000298.1"    "abc"
"WP_000000304.1"    "xyz"
"WP_000000307.1"    "random"
"WP_000000307.1"    "text"
"WP_000000308.1"    "stuff"
"WP_000000400.1"    "stuffy"

Simply put, I need to look up 'key' in the big file.

Obviously I need to load the whole table in RAM (but this is not a problem I have 32GB available). The big file seems to be already sorted. I have to check this.
The problem is that I cannot do a fast lookup using something like TDictionary because as you can see, the key is not unique.

Note: This is probably a one-time computation. I will use the program once, then throw it away. So, it doesn't have to be the BEST algorithm (difficult to implement). It just need to finish in decent time (like 1-2 days). PS: I prefer doing this without DB.

I was thinking at this possible solution: TList.BinarySearch. But it seems that TList is limited to only 134,217,727 (MaxInt div 16) items. So TList won't work.


Conclusion:
I choose Arnaud Bouchez solution. His TDynArray is impressive! I totally recommend it if you need to process large files.
AlekseyKharlanov's provided another nice solution but TDynArray is already implemented.

like image 409
Server Overflow Avatar asked Sep 25 '16 10:09

Server Overflow


1 Answers

Instead of re-inventing the wheel of binary search or B-Tree, try with an existing implementation.

Feed the content into a SQLite3 in-memory DB (with the proper index, and with a transaction every 10,000 INSERT) and you are done. Ensure you target Win64, to have enough space in RAM. You may even use a file-based storage: a bit slower to create, but with indexes, queries by Key will be instant. If you do not have SQlite3 support in your edition of Delphi (via latest FireDAC), you may use our OpenSource unit and its associated documentation.

Using SQlite3 will be definitively faster, and use less resources than a regular client-server SQL database - BTW the "free" edition of MS SQL is not able to handle so much data you need, AFAIR.

Update: I've written some sample code to illustrate how to use SQLite3, with our ORM layer, for your problem - see this source code file in github.

Here are some benchmark info:

  with index defined before insertion:
    INSERT 1000000 rows in 6.71s
    SELECT 1000000 rows per Key index in 1.15s

  with index created after insertion:
    INSERT 1000000 rows in 2.91s
    CREATE INDEX 1000000 in 1.28s
    SELECT 1000000 rows per Key index in 1.15s

  without the index:
    INSERT 1000000 rows in 2.94s
    SELECT 1000000 rows per Key index in 129.27s

So for huge data set, an index is worth it, and creating the index after the data insertion reduces the resources used! Even if the insertion is slower, the gain of an index is huge when selecting per key.. You may try to do the same with MS SQL, or using another ORM, and I guess you will cry. ;)

like image 82
Arnaud Bouchez Avatar answered Sep 22 '22 08:09

Arnaud Bouchez