Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Right way of validating xlsx files before database inserting

while playing with PHPExcel I came across some questions how properly handle validation/and inserting values into a database. I do not need any codes, just the general concept how to do it.

Firstly I iterate through first row to check if the columns are matching the given one ( if it fits the schema ).

On the next step, I get the rows and meanwhile its beeing validated row/column wise. If the type doesn't match I will get an error.

While validating the row, I need to get the Worker name and convert it to id get_worker_id().

Question number #1. Is such solution a good practice? It will produce upto 100 queries. Foreach row - 1.

Question number #2 I also need to validate the rows once again, I would take the worker_id, the F and G column to check if such record isn't present in the database. I would simply introduce a function similar to get_worker_id() but it would return true/false if entry exists.

But again is this the proper way of doing it? By raw calculations my method would produce 100 selects ( get_worker_id ), 100 selects ( validate if exists ), 100 insert ( if all is ok ).

Im not sure if I am doing it properly. Could you hit me up with some advices?

Thanks in forwards.

Model for handling the xlsx file.

class Gratyfikant_model extends CI_Model {

    private $_limit = 100;

    const columns = array(
        'A' => "Z",
        'B' => "KS",
        'C' => "G",
        'D' => "S",
        'E' => "Numer",
        'F' => "Miesiąc", // required
        'G' => "Data wypłaty", // required
        'H' => "Pracownik", // required
        'I' => "Brutto duże", // required
        'J' => "ZUS pracownik", // required
        'K' => "ZUS pracodawca", // required
        'L' => "Do wypłaty", // required 
        'M' => "Obciążenie", // required
        'N' => "FW");
    const validators = array(
        'F' => 'date',
        'G' => 'date',
        'H' => 'string',
        'I' => 'float',
        'J' => 'float',
        'K' => 'float',
        'L' => 'float',
        'M' => 'float',
    );
    const validators_errors = array(
        'float' => "Wartość nie jest liczbą",
        'string' => "Wartość nie jest poprawna",
        'date' => "Wartość nie jest datą"
    );

    protected $_required = array(
        'H', 'I', 'J', 'K', 'L', 'M'
    );
    private $_sheet = array();
    private $_sheet_pracownicy = array();
    private $_agregacja = array();
    protected $_invalid_rows = array();

    public function __construct() {
        parent::__construct();
    }

    public function read_data(array $dane) {
        if (count($dane) > $this->_limit) {
            throw new Exception('Limit wierszy to ' . $this->_limit);
        }
        $this->_sheet = $dane;
        return $this;
    }

    public function column_validation() {
        foreach ($this->_required as $r) {
            if (!isset($this->_sheet[1][$r]) || $this->_sheet[1][$r] != self::columns[$r] || !array_key_exists($r, $this->_sheet[1])
            ) {
                throw new Exception('Kolumna - ' . $r . ' - Wartość nagłówka nie pasuje do szablonu, powinno być ' . self::columns[$r]);
            }
        }

        return $this;
    }

    function validateDate($date) {
        $d = DateTime::createFromFormat('Y-m-d', $date);
        return $d && $d->format('Y-m-d') === $date;
    }

    private function row_validation($k, $a, $v, $f) {

        switch ($v) {
            case "date":
                $cellval = $this->validateDate(PHPExcel_Style_NumberFormat::toFormattedString($f, PHPExcel_Style_NumberFormat::FORMAT_DATE_YMD));
                break;
            case "float":
                $cellval = is_float($f);
                break;
            case "string":
                $cellval = is_string($f);
                break;
            default:
                break;
        }
        if (!$cellval) {
            $this->_invalid_rows[$a][$k] = $v;
        }
    }

    public function get_sheet_data() {
        $dane = $this->_sheet;
        unset($dane[1]); // remove first col

        $zus_pracownik = 0;
        $zus_pracodawca = 0;
        $zus_lacznie = 0;
        $do_wyplaty = 0;
        $obciazenie = 0;
        $brutto = 0;
        foreach ($dane as $a => $d) {
            foreach (self::validators as $k => $v) {
                echo $this->row_validation($k, $a, $v, $d[$k]);
            }
            if (!is_null($d["H"]) && !empty($d["H"])) {
                // $this->_sheet_pracownicy[$d["H"]]["numer"] = PHPExcel_Style_NumberFormat::toFormattedString($d["E"], PHPExcel_Style_NumberFormat::FORMAT_DATE_DDMMYYYY);
                $this->_sheet_pracownicy[] = array(
                    "pracownik" => $d["H"],
                    "miesiac" => PHPExcel_Style_NumberFormat::toFormattedString($d["F"], PHPExcel_Style_NumberFormat::FORMAT_DATE_YMD),
                    "data_wyplaty" => PHPExcel_Style_NumberFormat::toFormattedString($d["G"], PHPExcel_Style_NumberFormat::FORMAT_DATE_YMD),
                    "zus_pracownik" => $d["J"],
                    "zus_pracodawca" => $d["K"],
                    "zus_lacznie" => bcadd($d["K"], $d["J"]),
                    "do_wyplaty" => $d["L"],
                    "obciazenie" => $d["M"],
                    "brutto" => $d["I"],
                    "id_prac" => $this->get_worker_id($d["H"]));

                $zus_pracownik = bcadd($zus_pracownik, $d["J"]);
                $zus_pracodawca = bcadd($zus_pracodawca, $d["K"]);
                $zus_lacznie = bcadd($zus_lacznie, bcadd($d["K"], $d["J"]));
                $do_wyplaty = bcadd($do_wyplaty, $d["L"]);
                $obciazenie = bcadd($obciazenie, $d["M"]);
                $brutto = bcadd($brutto, $d["I"]);
            }
        }
        $this->_agregacja = array(
            "zus_pracownik" => $zus_pracownik,
            "zus_pracodawca" => $zus_pracodawca,
            "zus_lacznie" => $zus_lacznie,
            "do_wyplaty" => $do_wyplaty,
            "obciazenie" => $obciazenie,
            "brutto" => $brutto
        );

        return $this;
    }

    public function display_result() {
        if (empty($this->_invalid_rows)) {
            return array(
                "wartosci" => $this->_sheet_pracownicy,
                "agregacja" => $this->_agregacja
            );
        }
    }

    public function display_errors() {
        foreach ($this->_invalid_rows as $k => $a) {
            foreach ($a as $key => $value) {
                throw new Exception('Pole ' . $key . '' . $k . ' ' . self::validators_errors[$value]);
            }
        }
        return $this;
    }

    public function get_worker_id($getAd) {

        $this->db->select('id_pracownika as id')
                ->from('pracownicy')
                ->like('CONCAT( imie,  \' \', nazwisko )', $getAd)
                ->or_like('CONCAT( nazwisko,  \' \', imie )', $getAd);


        $query = $this->db->get();



        $result = $query->result_array();
        if (isset($result[0]["id"])) {
            return $result[0]["id"];
        } else {
            throw new Exception('Nie odnaleziono ' . $getAd . ' w bazie danych, proszę dodać pracownika a następnie ponownie wczytać plik');
        }
    }

}

Display

 try {

            $data['s'] = $this->gm
                    ->read_data($sheetData)
                    ->column_validation()
                    ->get_sheet_data()
                    ->display_errors()
                    ->display_result();


        } catch (Exception $e) {

            $data['ex'] = $e->getMessage();
        }

XLSX file example

+---+---------------+---+---+------------+---------+--------------+-----------+-------------+---------------+----------------+------------+------------+--------+
| Z |      KS       | G | S |   Numer    | Miesiąc | Data wypłaty | Pracownik | Brutto duże | ZUS pracownik | ZUS pracodawca | Do wypłaty | Obciążenie |   FW   |
+---+---------------+---+---+------------+---------+--------------+-----------+-------------+---------------+----------------+------------+------------+--------+
|   | nieprzekazany | G |   | 03.08.2017 | sie.17  |   08.09.2017 | Worker1   |        2000 |         274,2 |          392,2 |    1459,48 |     2392,2 | (brak) |
|   | nieprzekazany | G |   | 03.08.2017 | sie.17  |   08.09.2017 | Worker2   |        1000 |         137,1 |          171,6 |     768,24 |     1171,6 | (brak) |
|   | nieprzekazany | G |   | 03.08.2017 | sie.17  |   08.09.2017 | Worker3   |        2000 |         274,2 |          392,2 |    1413,88 |     2392,2 | (brak) |
|   | nieprzekazany | G |   | 03.08.2017 | sie.17  |   08.09.2017 | Worker4   |        2000 |         274,2 |          392,2 |    1418,88 |     2392,2 | (brak) |
+---+---------------+---+---+------------+---------+--------------+-----------+-------------+---------------+----------------+------------+------------+--------+
like image 673
Kavvson Empcraft Avatar asked Sep 08 '17 10:09

Kavvson Empcraft


2 Answers

This really depends on the scale of your application and how frequently this Excel file will be imported. For example, if your application receives little to no traffic then running several queries per line is not the end of the world. If you already have the server and database setup and running then you might as well make use of them. Conversely, if your application is under constant heavy load then trying to minimize the amount of queries you run may be a good idea.

Option 1

If your application is small and/or doesn't get much traffic then don't worry about the ~300 queries you need to make. MySQL is not fragile and if you have indexed your data well your queries will be very fast.

Option 2

Move to querying the data you need first and storing it in memory so you can perform your logic checks in PHP.

This means for Question 1 you should get all of your workers in one query and then build a lookup array in PHP.

Here is a very rough example:

// Get all workers
SELECT worker_name, worker_id FROM workers;

// Build a lookup array from the query results
$worker_array = array(
   'Worker1' => 1,
   'Worker2' => 2,
   ...
);

// Then as you loop each row check if the work is in your lookup array
if ( ! isset($worker_array[$excel_row['worker_name']])) {
   // do something
}

Likewise for Question 2 you could get your unique data samples in one query (you don't need the entire record, just the unique fields). However, this may present a problem if you have a lot of unique data samples.

Option 3

Create a temporary table in MySQL and import your Excel data without performing any logic checks. Then you can perform your logic checks entirely in SQL.

Here is a very rough example without knowing anything about your data structure:

-- Get all records in the Excel data that match unique data samples
SELECT
  *
FROM
  temporary_table tt
JOIN
  workers w
  ON w.worker_name=tt.worker_name
JOIN
  data d
  ON d.worker_id=w.worker_id
  AND d.col_f=tt.col_f
  AND d.col_g=tt.col_g

If there are no issues with the data then you can perform an INSERT from your temporary table into your data table. This limits your queries to the initial insert (you can batch this for better performance as well), the data check and the insert from temp to real data.

Recap

It all comes down to your application. If you can get away with doing Option 1 and you've already got it implemented then that's fine for now. You don't need to over optimize things if you don't see this application growing like crazy.

However, if you are worried about scale and growth then I'd personally look at implementing Option 3.

like image 71
Mike S Avatar answered Oct 05 '22 23:10

Mike S


There are multiple concerns here:

Split import into stages.

  1. Validate headers. (Break importing if errors found) 2) Iterate over each row.
  2. Validate row.
  3. Import if valid.
  4. Log error if any.
  5. Stop processing file if all rows where processed or go back to 2.

As to weather you need some chunking it depend on how much time and memory your script is consuming. If you need it, it's as simple as reading X rows to memory and then processing it. At extreme you can load each record separately. If you do not need it just load it all to array.

chunking - consuming up to X rows in a single iteration, then clearing memory then consuming next chunk...

like image 42
przemo_li Avatar answered Oct 05 '22 23:10

przemo_li