Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for merging large files

I have several log files of events (one event per line). The logs can possibly overlap. The logs are generated on separate client machines from possibly multiple time zones (but I assume I know the time zone). Each event has a timestamp that was normalized into a common time (by instantianting each log parsers calendar instance with the timezone appropriate to the log file and then using getTimeInMillis to get the UTC time). The logs are already sorted by timestamp. Multiple events can occur at the same time, but they are by no means equal events.

These files can be relatively large, as in, 500000 events or more in a single log, so reading the entire contents of the logs into a simple Event[] is not feasible.

What I am trying do is merge the events from each of the logs into a single log. It is kinda like a mergesort task, but each log is already sorted, I just need to bring them together. The second component is that the same event can be witnessed in each of the separate log files, and I want to "remove duplicate events" in the file output log.

Can this be done "in place", as in, sequentially working over some small buffers of each log file? I can't simply read in all the files into an Event[], sort the list, and then remove duplicates, but so far my limited programming capabilities only enable me to see this as the solution. Is there some more sophisticated approach that I can use to do this as I read events from each of the logs concurrently?

like image 883
Josh Avatar asked Dec 03 '22 16:12

Josh


2 Answers

  1. Read the first line from each of the log files

  2. LOOP

    a. Find the "earliest" line.

    b. Insert the "earliest" line into the master log file

    c. Read the next line from the file that contained the earliest line

You could check for duplicates between b and c, advancing the pointer for each of those files.

like image 130
Adam Tegen Avatar answered Dec 06 '22 05:12

Adam Tegen


Sure - open every log file. Read in the first line for each into an array of 'current' lines. Then repeatedly pick the line with the lowest timestamp from the current array. Write it to the output, and read a new line from the appropriate source file to replace it.

Here's an example in Python, but it makes good pseudocode, too:

def merge_files(files, key_func):
    # Populate the current array with the first line from each file
    current = [file.readline() for file in files]
    while len(current) > 0:
        # Find and return the row with the lowest key according to key_func
        min_idx = min(range(len(files)), key=lambda x: return key_func(current[x]))
        yield current[min_idx]
        new_line = files[min_idx].readline()
        if not new_line:
            # EOF, remove this file from consideration
            del current[min_idx]
            del files[min_idx]
        else:
            current[min_idx] = new_line
like image 20
Nick Johnson Avatar answered Dec 06 '22 07:12

Nick Johnson