Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shortest and fastest way to parse php data

Tags:

import

php

mysql

I have files I need to convert into a database. These files (I have over 100k) are from an old system (generated from a COBOL script). I am now part of the team that migrate data from this system to the new system.

Now, because we have a lot of files to parse (each files is from 50mb to 100mb) I want to make sure I use the right methods in order to convert them to sql statement.

Most of the files have these following format:

#id<tab>name<tab>address1<tab>address2<tab>city<tab>state<tab>zip<tab>country<tab>#\n

the address2 is optional and can be empty or

#id<tab>client<tab>taxid<tab>tagid<tab>address1<tab>address2<tab>city<tab>state<tab>zip<tab>country<tab>#\n

these are the 2 most common lines (I'll say around 50%), other than these, all the line looks the same but with different information.

Now, my question is what should I do to open them to be as efficient as possible and parse them correctly?

like image 947
zackaryka Avatar asked Dec 06 '11 02:12

zackaryka


2 Answers

Honestly, I wouldn't use PHP for this. I'd use awk. With input that's as predictably formatted as this, it'll run faster, and you can output into SQL commands which you can also insert via a command line.

If you have other reasons why you need to use PHP, you probably want to investigate the fgetcsv() function. Output is an array which you can parse into your insert. One of the first user-provided examples takes CSV and inserts it into MySQL. And this function does let you specify your own delimiter, so tab will be fine.

If the id# in the first column is unique in your input data, then you should definitely insert this into a primary key in mysql, to save you from duplicating data if you have to restart your batch.

like image 157
ghoti Avatar answered Nov 11 '22 10:11

ghoti


When I worked on a project where it was necessary to parse huge and complex log files (Apache, firewall, sql), we had a big gain in performance using the function preg_match_all(less than 10% of the time required using explode / trims / formatting).

Huge files (>100Mb) are parsed in 2 or 3 minutes in a core 2 duo (the drawback is that memory consumption is very high since it creates a giant array with all the information ready to be synthesized).

Regular expressions allow you to identify the content of line if you have variations within the same file.

But if your files are simple, try ghoti suggestion (fgetscv), will work fine.

like image 40
Paulo H. Avatar answered Nov 11 '22 10:11

Paulo H.