I was wondering if anyone out there knew how this could be done in PHP. I am running a script that involves opening a file, taking the first 1000 lines, doing some stuff with those lines, then the php file opens another instance of itself to take the next thousand lines and so on until it reaches the end of the file. I'm using splfileobject so that I can seek to a certain line, which allows me to break this up into 1000 line chunks quite well. The biggest problem that I'm having is with performance. I'm dealing with files that have upwards of 10,000,000 lines and while it does the first 10,000 lines or so quite fast, there is a huge exponential slowdown after that point that I think is just having to seek to that point.
What I would like to do is read the first thousand lines, then just delete them from the file so that my script is always reading the first thousand lines. Is there a way to do this without reading the rest of the file into memory. Other solutions I have seen involve reading each line into an array then getting rid of the first X entries, but with ten million lines that will eat up too much memory and time.
If anyone has a solution or other suggestions that would speed up the performance, it would be greatly appreciated.
Unfortunately there is no real solution to this because the files are always loaded fully on to the main memory before they are read.
Still, I have posted this answer because this is a possible solution but I suspect it hardly improves the performance. Correct me if I am wrong.
You can use XML to divide the files into units of 1000 lines. And use DomDocument Class of PHP to retrieve and append data. You can append the child when you want to add data and retrieve the first child to get the first thousand lines and delete the node it if you want. Just like this :
<document>
<part>
. . .
Thousand lines here
. . .
</part>
<part>
. . .
Thousand lines here
. . .
</part>
<part>
. . .
Thousand lines here
. . .
</part>
.
.
.
</document>
ANOTHER WAY :
If you are really sure about breaking the sections into exactly 1000 lines why don't you save it in a database with each 1000 in a different row ? By doing this you will surely reduce file read/write overhead and improve the performance.
It seems to me that the objective is to parse a huge amount of data and insert it into a database? If so, I fail to understand why it's important to work with exactly 1000 lines?
I think I would just approach it by reading a big chunk of data, say 1 MB, into memory at once, and then scan backwards from the end of the in-memory chunk for the last line ending. Once I have that, I can save the file position and the extra data I have (what's left over from the last line ending until the end of the chunk). Alternatively, just reset the file pointer using fseek() to where in the file that I found the last line ending, easily accomplished with strlen($chunk).
That way, all I have to do is explode the chunk by running explode("\r\n", $chunk) and I have all the lines I need, in a suitably big block for further processing.
Deleting lines from the beginning of the file is not recommended. That's going to shuffle a huge amount of data back and forth to disk.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With