Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Special characters throwing off str_pad in php?

Tags:

php

I'm writing a module that is supposed to be able to export transaction records in BankOne format.

Here is the specification of the format

Here is an example file

The fields are put in specific ranges on the line and records are seperated by new lines. Lots of spaces needs to be added to ensure that the fields start and end at specific points in the line.

I wrote a function in php for this. It takes in the fields as parameters and should return a properly formatted record.

function record4($checknum='', $nameid='', $purpose='', $pledge='', $payment='', 
             $frequency='', $title='', $fname='', $lname='', $suffix='',
             $address='', $postalcode='', $city='', $state='', $greeting='')
{
$fields = array(
    'checknum' => array('length' => 8, 'start' => 37),
    'nameid' => array('length' => 7, 'start' => 45),
    'purpose' => array('length' => 5, 'start' => 52),
    'pledge' => array('length' => 10, 'start' => 57),
    'payment' => array('length' => 10, 'start' => 67),
    'frequency' => array('length' => 1, 'start' => 77),
    'title' => array('length' => 20, 'start' => 78),
    'fname' => array('length' => 40, 'start' => 98),
    'lname' => array('length' => 40, 'start' => 138),
    'suffix' => array('length' => 20, 'start' => 178),
    'address' => array('length' => 35, 'start' => 198),
    'postalcode' => array('length' => 10, 'start' => 233),
    'city' => array('length' => 28, 'start' => 243),
    'state' => array('length' => 5, 'start' => 271),
    'greeting' => array('length' => 40, 'start' => 276)
);

$str = '4';
foreach($fields as $field_name => $field)
{
    if($$field_name)
    {
        $str = str_pad($str, $field['start']-1, ' ');
        $str = $str.substr(trim((string)$$field_name), 0, $field['length']);
    }
}

return $str."\n";
}

It seems to work as intended, but when I looked at the output file I found this (scroll to the end):

4                                                                 1                              David                                   Landrum
4                                                                 3                              Hazel                                   Baker
4                                                                 3                              Jerome                                  Zehnder
4                                                                 1                              Víctor                               Nadales
4                                                                 2                              Philip                                  Nauert
4                                                                 1                              Jana                                    Ortcutter

The file contains 900 records pulled from a database, all of them are formatted correctly, except Víctor Nadales. After that first name, every other field is three spaces left of where it is supposed to be. The only anomalous thing about this record appears to be the 'Ã' in the first name.

The function is supposed to pad out the string to the proper length after each and every field it processes, yet it somehow gets fooled on this one line?

Can anyone tell me what is going on here?

EDIT: I just realized that whatever imports files of this format might not even support special UTF-8 characters. Therefore I added this line to my code:

$$field_name = iconv('UTF-8', 'ASCII//TRANSLIT', $$field_name);

The à comes out looking like this: ~A-. Not ideal, but at least the file is formatted properly now.

like image 322
Peronix Avatar asked Aug 08 '12 19:08

Peronix


2 Answers

This is happening because 'Ã' is a multi-byte character (4 bytes long), and str_pad is counting bytes rather than logical characters.

This is why you are missing three spaces, str_pad is counting 'Ã' as 4 single byte characters instead of one multi-byte one.

Try this function (credit here).

<?
function mb_str_pad( $input, $pad_length, $pad_string = ' ', $pad_type = STR_PAD_RIGHT)
{
    $diff = strlen( $input ) - mb_strlen( $input );
    return str_pad( $input, $pad_length + $diff, $pad_string, $pad_type );
}
?>
like image 170
Gordon Bailey Avatar answered Sep 18 '22 17:09

Gordon Bailey


Using Gordon's solution you just have to add the encoding type to the mb_strlen and it will count the bytes correctly (at least it worked for me)

Here is the function I used:

function mb_str_pad( $input, $pad_length, $pad_string = ' ', $pad_type = STR_PAD_RIGHT, $encoding="UTF-8") {
    $diff = strlen( $input ) - mb_strlen($input, $encoding);
    return str_pad( $input, $pad_length + $diff, $pad_string, $pad_type );
}

Credit for the idea here

like image 26
Felipe Balduino Cassar Avatar answered Sep 20 '22 17:09

Felipe Balduino Cassar