Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read n lines of a big text file

The smallest file I have has > 850k lines and every line is of unknown length. The goal is to read n lines from this file in the browser. Reading it fully is not going to happen.

Here is the HTML <input type="file" name="file" id="file"> and the JS I have:

var n = 10;
var reader = new FileReader();
reader.onload = function(progressEvent) {
  // Entire file
  console.log(this.result);

  // By lines
  var lines = this.result.split('\n');
  for (var line = 0; line < n; line++) {
    console.log(lines[line]);
  }
};

Obviously, the problem here is that it tries to first real the whole file and then split it by newline. So no matter of n, it will try to read the whole file, and eventually read nothing when the file is big.

How should I do it?

Note: I am willing to delete the whole function and start from scratch, given that I will be able to console.log() every line that we read.


*"every line is of unknown length" -> means that the file is something like this:

(0, (1, 2))
(1, (4, 5, 6))
(2, (7))
(3, (8))

Edit:

The way to go would be something like filereader api on big files, but I can't see how I can modify that to read n lines of the file...

By using Uint8Array to string in Javascript too, one can do from there:

var view = new Uint8Array(fr.result);
var string = new TextDecoder("utf-8").decode(view);
console.log("Chunk " + string);

but this may not read the last line as a whole, so how are you going to determine the lines later? For example here is what it printed:

((7202), (u'11330875493', u'2554375661'))
((1667), (u'9079074735', u'6883914476',
like image 294
gsamaras Avatar asked Sep 13 '16 21:09

gsamaras


1 Answers

The logic is very similar to what I wrote in my answer to filereader api on big files, except you need to keep track of the number of lines that you have processed so far (and also the last line read so far, because it may not have ended yet). The next example works for any encoding that is compatible with UTF-8; if you need another encoding look at the options for the TextDecoder constructor.

If you are certain that the input is ASCII (or any other single-byte encoding), then you can also skip the use of TextDecoder and directly read the input as text using the FileReader's readAsText method.

// This is just an example of the function below.
document.getElementById('start').onclick = function() {
    var file = document.getElementById('infile').files[0];
    if (!file) {
        console.log('No file selected.');
        return;
    }
    var maxlines = parseInt(document.getElementById('maxlines').value, 10);
    var lineno = 1;
    // readSomeLines is defined below.
    readSomeLines(file, maxlines, function(line) {
        console.log("Line: " + (lineno++) + line);
    }, function onComplete() {
        console.log('Read all lines');
    });
};

/**
 * Read up to and including |maxlines| lines from |file|.
 *
 * @param {Blob} file - The file to be read.
 * @param {integer} maxlines - The maximum number of lines to read.
 * @param {function(string)} forEachLine - Called for each line.
 * @param {function(error)} onComplete - Called when the end of the file
 *     is reached or when |maxlines| lines have been read.
 */
function readSomeLines(file, maxlines, forEachLine, onComplete) {
    var CHUNK_SIZE = 50000; // 50kb, arbitrarily chosen.
    var decoder = new TextDecoder();
    var offset = 0;
    var linecount = 0;
    var linenumber = 0;
    var results = '';
    var fr = new FileReader();
    fr.onload = function() {
        // Use stream:true in case we cut the file
        // in the middle of a multi-byte character
        results += decoder.decode(fr.result, {stream: true});
        var lines = results.split('\n');
        results = lines.pop(); // In case the line did not end yet.
        linecount += lines.length;
    
        if (linecount > maxlines) {
            // Read too many lines? Truncate the results.
            lines.length -= linecount - maxlines;
            linecount = maxlines;
        }
    
        for (var i = 0; i < lines.length; ++i) {
            forEachLine(lines[i] + '\n');
        }
        offset += CHUNK_SIZE;
        seek();
    };
    fr.onerror = function() {
        onComplete(fr.error);
    };
    seek();
    
    function seek() {
        if (linecount === maxlines) {
            // We found enough lines.
            onComplete(); // Done.
            return;
        }
        if (offset !== 0 && offset >= file.size) {
            // We did not find all lines, but there are no more lines.
            forEachLine(results); // This is from lines.pop(), before.
            onComplete(); // Done
            return;
        }
        var slice = file.slice(offset, offset + CHUNK_SIZE);
        fr.readAsArrayBuffer(slice);
    }
}
Read <input type="number" id="maxlines"> lines from
<input type="file" id="infile">.
<input type="button" id="start" value="Print lines to console">
like image 86
Rob W Avatar answered Oct 08 '22 19:10

Rob W