Working with files and utf8 in PHP

Tags:

Lets say I have a file called foo.txt encoded in utf8:

aoeu  
qjkx
ñpyf

And I want to get an array that contains all the lines in that file (one line per index) that have the letters aoeuñpyf, and only the lines with these letters.

I wrote the following code (also encoded as utf8):

$allowed_letters=array("a","o","e","u","ñ","p","y","f");

$lines=array();
$f=fopen("foo.txt","r");
while(!feof($f)){
    $line=fgets($f);
    foreach(preg_split("//",$line,-1,PREG_SPLIT_NO_EMPTY) as $letter){
        if(!in_array($letter,$allowed_letters)){
            $line="";
        }
    }
    if($line!=""){
        $lines[]=$line;
    }
}
fclose($f);

However, after that, the $lines array just has the aoeu line in it.
This seems to be because somehow, the "ñ" in $allowed_letters is not the same as the "ñ" in foo.txt.
Also if I print a "ñ" of the file, a question mark appears, but if I print it like this print "ñ";, it works.
How can I make it work?

696

asked Sep 26 '10 23:09

Gerardo Marset

2 Answers

If you are running Windows, the OS does not save files in UTF-8, but in cp1251 (or something...) by default you need to save the file in that format explicitly or run each line in utf8_encode() before performing your check. I.e.:

$line=utf8_encode(fgets($f));

If you are sure that the file is UTF-8 encoded, is your PHP file also UTF-8 encoded?

If everything is UTF-8, then this is what you need :

foreach(preg_split("//u",$line,-1,PREG_SPLIT_NO_EMPTY) as $letter){
   // ...
}

(append u for unicode chars)

However, let me suggest a yet faster way to perform your check :

$allowed_letters=array("a","o","e","u","ñ","p","y","f");

$lines=array();
$f=fopen("foo.txt","r");
while(!feof($f)){
    $line=fgets($f);

    $line = str_split(rtrim($line));
    if (count(array_intersect($line, $allowed_letters)) == count($line)) {
            $lines[] = $line;
    }
}
fclose($f);

(add space chars to allow space characters as well, and remove the rtrim($line))

107

answered Sep 24 '22 03:09

Yanick Rochon

In UTF-8, ñ is encoded as two bytes. Normally in PHP all string operations are byte-based, so when you preg_split the input it splits up the first byte and the second byte into separate array items. Neither the first byte on its own nor the second byte on its own will match both bytes together as found in $allowed_letters, so it'll never match ñ.

As Yanick posted, the solution is to add the u modifier. This makes PHP's regex engine treat both the pattern and the input line as Unicode characters instead of bytes. It's lucky that PHP has special Unicode support here; elsewhere PHP's Unicode support is extremely spotty.

A simpler and quicker way than splitting would be to compare each line against a character-group regex. Again, this must be a u regex.

if(preg_match('/^[aoeuñpyf]+$/u', $line))
    $lines[]= $line;

answered Sep 22 '22 03:09

bobince

Related questions
                            
                                Hide index.php (or index.html) of an URL
                            
                                PHP/MySQL security--where to begin?
                            
                                unbindModel call in CakePhp. How does it work?
                            
                                Sending a file via HTTP PUT in PHP
                            
                                Zend Framework: Autoloading a Class Library
                            
                                Why did the creator of prado create Yii?
                            
                                Building PHP Competencies in an organization [closed]
                            
                                Problems opening php files in Eclipse
                            
                                Skipping PHP end tag [duplicate]
                            
                                Getting the contents of a file with PHP FTP
                            
                                How to know if the website being scraped has changed?
                            
                                Safe way to store decryptable passwords
                            
                                Wordpress database insert() and update() - using NULL values
                            
                                Assert that request verb is POST
                            
                                Anyone knows what the code snippet means below?
                            
                                array_key_exists is not working
                            
                                PHP: How to get creation date from uploaded file?
                            
                                Import excel files with image in php/mysql
                            
                                $settings array or Config Class to store project settings?
                            
                                Repeat array to a certain length?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Working with files and utf8 in PHP

Tags:

php

file-io

unicode

utf-8

Gerardo Marset

People also ask

2 Answers

Yanick Rochon

bobince

Recent Activity

Donate For Us