Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a file to UTF-8 in php?

Is it possible to convert a file into UTF-8 on my end?

If I have an access on the file after the submission with

$_FILES['file']['tmp_name']

Note: The user can upload a CSV file with any kind of charset, I usually encounter an unknown 8-bit charset.

I try

$row = array();
$datas = file($_FILES['file']['tmp_name']);
foreach($datas as $data) {
    $data = mb_convert_encoding($data, 'UTF-8');
    $row[] = explode(',', $data);
}

But the problem is, this code remove special characters like single quote.

My first question is htmlspecialchars remove the value inside the array?

I put it for additional information. Thanks for those who can help!

like image 342
Julio de Leon Avatar asked Oct 20 '17 04:10

Julio de Leon


People also ask

How to set encoding to UTF-8 in PHP?

PHP UTF-8 Encoding – modifications to your php. The first thing you need to do is to modify your php. ini file to use UTF-8 as the default character set: default_charset = "utf-8"; (Note: You can subsequently use phpinfo() to verify that this has been set properly.)

What is UTF-8 PHP?

The utf8_encode() function is an inbuilt function in PHP which is used to encode an ISO-8859-1 string to UTF-8. Unicode has been developed to describe all possible characters of all languages and includes a lot of symbols with one unique number for each symbol/character.

Are php strings UTF-8?

All PHP string functions work well with UTF-8 encoded strings as long as the strings use only 7-bit ASCII characters (because the encoding of the first 128 characters is identical in ASCII and UTF-8).


2 Answers

before you can convert it to utf-8, you need to know what characterset it is. if you can't figure that out, you can't in any sane way convert it to utf8.. however, an insane way to convert it to utf-8, if the encoding cannot be determined, is to simply strip any bytes that doesn't happen to be valid in utf-8, you might be able to use that as a fallback...

warning, untested code (im suddenly in a hurry), but may look something like this:

foreach ( $datas as $data ) {
    $encoding = guess_encoding ( $data );
    if (empty ( $encoding )) {
        // encoding cannot be determined...
        // as a fallback, we simply strip any bytes that isnt valid utf-8...
        // obviously this isn't a reliable conversion scheme.
        // also this could probably be improved
        $data = iconv ( "ASCII", "UTF-8//TRANSLIT//IGNORE", $text );
    } else {
        $data = mb_convert_encoding ( $data, 'UTF-8', $encoding );
    }
    $row [] = explode ( ',', $data );
}
function guess_encoding(string $str): string {
    $blacklist = array (
            'pass',
            'auto',
            'wchar',
            'byte2be',
            'byte2le',
            'byte4be',
            'byte4le',
            'BASE64',
            'UUENCODE',
            'HTML-ENTITIES',
            '7bit',
            '8bit' 
    );
    $encodings = array_flip ( mb_list_encodings () );
    foreach ( $blacklist as $tmp ) {
        unset ( $encodings [$tmp] );
    }
    $encodings = array_keys ( $encodings );
    $detected = mb_detect_encoding ( $str, $encodings, true );
    return ( string ) $detected;
}
like image 181
hanshenrik Avatar answered Oct 15 '22 22:10

hanshenrik


Try this out.
The example I have used was something I was doing in a test environment, you might need to change the code slightly.

I had a text file with the following data in:

test
café
áÁÁÁááá
žžœš¥±
ÆÆÖÖÖasØØ
ß

Then I had a form which took a file input in and performed the following code:

function neatify_files(&$files) {
    $tmp = array();
    for ($i = 0; $i < count($_FILES); $i++) {
        for ($j = 0; $j < count($_FILES[array_keys($_FILES)[$i]]["name"]); $j++) {
            $tmp[array_keys($_FILES)[$i]][$j]["name"] = $_FILES[array_keys($_FILES)[$i]]["name"][$j];
            $tmp[array_keys($_FILES)[$i]][$j]["type"] = $_FILES[array_keys($_FILES)[$i]]["type"][$j];
            $tmp[array_keys($_FILES)[$i]][$j]["tmp_name"] = $_FILES[array_keys($_FILES)[$i]]["tmp_name"][$j];
            $tmp[array_keys($_FILES)[$i]][$j]["error"] = $_FILES[array_keys($_FILES)[$i]]["error"][$j];
            $tmp[array_keys($_FILES)[$i]][$j]["size"] = $_FILES[array_keys($_FILES)[$i]]["size"][$j];
        }
    }
    return $files = $tmp;
}

if (isset($_POST["submit"])) {
    neatify_files($_FILES);
    $file = $_FILES["file"][0];

    $handle = fopen($file["tmp_name"], "r");
    while ($line = fgets($handle)) {
        $enc = mb_detect_encoding($line, "UTF-8", true);
        if (strtolower($enc) != "utf-8") {
            echo "<p>" . (iconv($enc, "UTF-8", $line)) . "</p>";
        } else {
            echo "<p>$line</p>";
        }
    }
}
?>
<form action="<?= $_SERVER["PHP_SELF"]; ?>" method="POST" enctype="multipart/form-data">
    <input type="file" name="file[]" />
    <input type="submit" name="submit" value="Submit" />
</form>

The function neatify_files is something I wrote to make the $_FILES array more logical in its layout.

The form is a standard form that simply POSTs the data to the server.
Note: Using $_SERVER["PHP_SELF"] is a security risk, see here for more.

When the data is posted I store the file in a variable. Obviously, if you are using the multiple attribute your code won't look quite like this.

$handle stores the entire contents of the text file, in a read-only format; hence the "r" argument.

$enc uses the mb_detect_encoding function to detect the encoding (duh).
At first I was having trouble with obtaining the correct encoding. Setting the encoding_list to use only UTF-8, and setting strict to be true.

If the encoding is UTF-8 then I simply print the line, if it didn't I converted it to UTF-8 using the iconv function.

like image 24
JustCarty Avatar answered Oct 15 '22 22:10

JustCarty