Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP: What is an efficient way to parse a text file containing very long lines?

I'm working on a parser in php which is designed to extract MySQL records out of a text file. A particular line might begin with a string corresponding to which table the records (rows) need to be inserted into, followed by the records themselves. The records are delimited by a backslash and the fields (columns) are separated by commas. For the sake of simplicity, let's assume that we have a table representing people in our database, with fields being First Name, Last Name, and Occupation. Thus, one line of the file might be as follows

[People] = "\Han,Solo,Smuggler\Luke,Skywalker,Jedi..."

Where the ellipses (...) could be additional people. One straightforward approach might be to use fgets() to extract a line from the file, and use preg_match() to extract the table name, records, and fields from that line.

However, let's suppose that we have an awful lot of Star Wars characters to track. So many, in fact, that this line ends up being 200,000+ characters/bytes long. In such a case, taking the above approach to extract the database information seems a bit inefficient. You have to first read hundreds of thousands of characters into memory, then read back over those same characters to find regex matches.

Is there a way, similar to the Java String next(String pattern) method of the Scanner class constructed using a file, that allows you to match patterns in-line while scanning through the file?

The idea is that you don't have to scan through the same text twice (to read it from the file into a string, and then to match patterns) or store the text redundantly in memory (in both the file line string and the matched patterns). Would this even yield a significant increase in performance? It's hard to tell exactly what PHP or Java are doing behind the scenes.

On fgetcsv()
This function makes it very easy to split lines in a file based on some delimiter, and I'm sure it checks for the delimiter character by character as it scans through the file. However, the problem is that there's essentially two delimiters that I'm looking for, and fgetcsv() only accepts one. For example:

I could use ',' as the delimiter. Provided I changed the file format to also have commas with a backslash, I could read the entire line into an array of fields. The problem, then, is I need to reiterate over all of the fields to determine where records start and end and to prepare the sql. Similarly, if I use '\' as the delimiter (a single backslash, escaped here), then I'll need to reiterate over all of the records to extract the fields and prepare the sql.

What I am trying to do is to check for both commas and backslashes (and perhaps other things, like the [tablename]) in one fell swoop for maximum performance. If fgetcsv() allowed to me specify multiple delimiters (or a regex) or allowed me to change what it considers to be the "end of a line" (from \n or \n\r to just \), then it would work perfectly, but that doesn't seem possible.

like image 341
Shaun Avatar asked Mar 31 '10 23:03

Shaun


1 Answers

You could write a character-by-character accumulation loop that (a) pushes field strings onto an array when it encounters commas and (b) calls a function to save accumulated field strings to a mysql database when it finds the record signifier:

while($c = fgetc($fp)) {
  if($c == ',') {
    $fields[] = implode(null,$accumulator);
    $accumulator = array();
  } else if($c == '\\') {
    save_fields_to_mysql($fields);
    $fields = array();
    $accumulator = array();
  } else
    $accumulator[] = $c;
}

This will probably work for you if you're certain that your fields never contain your field or record separators as data.

If that's a possibility, you'll need to come up with an escape sequence to represent literal values of your field and record separator (and probably your escape sequence as well). Let's say that this is the case, and assume the % sign as an escape character:

define('ESCAPED',1);
define('NORMAL',0);

$readState = NORMAL;
while($c = fgetc($fp)) {
  if($readState == ESCAPED) {
    $accumulator[] = $c;
    $readState = NORMAL;
  } else if($c == '%') {
    $readState = ESCAPED;
  } else if($c == ',') {
    $fields[] = implode(null,$accumulator);
    $accumulator = array();
  } else if($c == '\\') {
    save_fields_to_mysql($fields);
    $fields = array();
    $accumulator = array();
  } else
    $accumulator[] = $c;
}

ie, any occurance of a % sets a state variable which indicates on the next pass through the loop, whatever character we read will be taken as literal data which is part of a field rather than a signifier.

This should keep your memory usage at a minimum.

[Update] What about I/O efficiency?

One commenter correctly pointed out that this illustration is pretty I/O intensive, and since I/O tends to be the most costly operation in terms of time, it's entirely possible it wouldn't be an acceptable solution.

At one other end of the spectrum we have the option of buffering the entire file into memory, which includes the original memory-intensive solutions the Asker mentioned but wanted to avoid. The happy medium probably lies somewhere in the middle: we can use the read-limit you can pass as the second argument to fgets() to pull in a somewhat large (but not ridiculously large) number of characters in a single I/O gulp, and then process that buffer character-by-character instead of the I/O stream, refilling it when we burn through the buffer.

This does make the read process a little more code intensive than $c = fgetc($fp), though, because you have to monitor where you are in the buffer and how full the buffer is as well as where you are in the file. You can do this with a series of flags and index variables inside the read loop if you want, but it might be more convenient to have an abstraction something like this:

class StrBufferedChrReader {

    private $_filename;
    private $_fp; 

    private $_bufferIdx;
    private $_bufferMax = 2048;
    private $_buffer;

    function __construct($filename=null,$bufferMax=null) {
        if($bufferMax) $this->_bufferMax = $bufferMax;
        if($filename) $this->open($filename);
    }

    function _refillBuffer() {
        if($this->_fp) {
            $this->_buffer = fgets($this->_fp,$this->_bufferMax + 1);
            $this->_bufferIdx = 0;
            return $this->_buffer;
        }
        return false;
    }

    function open($filename=null) {
        if($filename) $this->_filename = $filename;
        if($this->_fp = fopen($this->_filename)) 
            $this->_refillBuffer();
        return $this->_fp;
    }

    function getc() {
        if($this->_bufferIdx == $this->_bufferMax) 
            if(!$this->_refillBuffer())
                return false;
        return $this->_buffer[$this->_bufferIdx++];
    }

    function close() {
        $this->_buffer = null;
        $this->_bufferIdx = null;
        return fclose($this->_fp);
    }
}

Which you could use in either loop above like so:

$r = new StrBufferedChrReader($filename,$bufferSize);
while($c = $r->getc()) {
    ...

Something like this allows you to stake out a lot of different spots along the continuum between a memory-intensive solution and an I/O intensive solution by changing $bufferSize. Bigger $bufferSize, more memory usage, fewer I/O ops. Smaller $bufferSize, less memory usage, more I/O ops.

(Note: don't assume that class is production-ready. It's meant as an illustration of a possible abstraction, may contain off-by-one or other errors. May cause blurred vision, lack of sleep, heart palpitations, or other side effects. Check with a doctor and unit testing before using.)

like image 194
Weston C Avatar answered Oct 27 '22 12:10

Weston C