Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# code to perform Binary search in a very big text file

Is there a library that I can use to perform binary search in a very big text file (can be 10GB).

The file is a sort of a log file - every row starts with a date and time. Therefore rows are ordered.

like image 655
Alex Avatar asked Feb 28 '23 05:02

Alex


2 Answers

I started to write the pseudo-code on how to do it, but I gave up since it may seem condescending. You probably know how to write a binary search, it's really not complicated.

You won't find it in a library, for two reasons:

  1. It's not really "binary search" - the line sizes are different, so you need to adapt the algorithm (e.g. look for the middle of the file, then look for the next "newline" and consider that to be the "middle").
  2. Your datetime log format is most likely non-standard (ok, it may look "standard", but think a bit.... you probably use '[]' or something to separate the date from the log message, something like [10/02/2001 10:35:02] My message ).

On summary - I think your need is too specific and too simple to implement in custom code for someone to bother writing a library :)

like image 149
Virgil Avatar answered Mar 08 '23 08:03

Virgil


As the line lengths are not guaranteed to be the same length, you're going to need some form of recognisable line delimiter e.g. carriage return or line feed.

The binary search pattern can then be pretty much your traditional algorithm. Seek to the 'middle' of the file (by length), seek backwards (byte by byte) to the start of the line you happen to land in, as identified by the line delimiter sequence, read that record and make your comparison. Depending on the comparison, seek halfway up or down (in bytes) and repeat.

When you identify the start index of a record, check whether it was the same as the last seek. You may find that, as you dial in on your target record, moving halfway won't get you to a different record. e.g. you have adjacent records of 100 bytes and 50 bytes respectively, so jumping in at 75 bytes always takes you back to the start of the first record. If that happens, read on to the next record before making your comparison.

You should find that you will reach your target pretty quickly.

like image 39
Neil Moss Avatar answered Mar 08 '23 10:03

Neil Moss