Is it possible to convert a file into UTF-8 on my end?
If I have an access on the file after the submission with
$_FILES['file']['tmp_name']
Note: The user can upload a CSV file with any kind of charset, I usually encounter an unknown 8-bit charset.
I try
$row = array();
$datas = file($_FILES['file']['tmp_name']);
foreach($datas as $data) {
$data = mb_convert_encoding($data, 'UTF-8');
$row[] = explode(',', $data);
}
But the problem is, this code remove special characters like single quote.
My first question is htmlspecialchars remove the value inside the array?
I put it for additional information. Thanks for those who can help!
PHP UTF-8 Encoding – modifications to your php. The first thing you need to do is to modify your php. ini file to use UTF-8 as the default character set: default_charset = "utf-8"; (Note: You can subsequently use phpinfo() to verify that this has been set properly.)
The utf8_encode() function is an inbuilt function in PHP which is used to encode an ISO-8859-1 string to UTF-8. Unicode has been developed to describe all possible characters of all languages and includes a lot of symbols with one unique number for each symbol/character.
All PHP string functions work well with UTF-8 encoded strings as long as the strings use only 7-bit ASCII characters (because the encoding of the first 128 characters is identical in ASCII and UTF-8).
before you can convert it to utf-8, you need to know what characterset it is. if you can't figure that out, you can't in any sane way convert it to utf8.. however, an insane way to convert it to utf-8, if the encoding cannot be determined, is to simply strip any bytes that doesn't happen to be valid in utf-8, you might be able to use that as a fallback...
warning, untested code (im suddenly in a hurry), but may look something like this:
foreach ( $datas as $data ) {
$encoding = guess_encoding ( $data );
if (empty ( $encoding )) {
// encoding cannot be determined...
// as a fallback, we simply strip any bytes that isnt valid utf-8...
// obviously this isn't a reliable conversion scheme.
// also this could probably be improved
$data = iconv ( "ASCII", "UTF-8//TRANSLIT//IGNORE", $text );
} else {
$data = mb_convert_encoding ( $data, 'UTF-8', $encoding );
}
$row [] = explode ( ',', $data );
}
function guess_encoding(string $str): string {
$blacklist = array (
'pass',
'auto',
'wchar',
'byte2be',
'byte2le',
'byte4be',
'byte4le',
'BASE64',
'UUENCODE',
'HTML-ENTITIES',
'7bit',
'8bit'
);
$encodings = array_flip ( mb_list_encodings () );
foreach ( $blacklist as $tmp ) {
unset ( $encodings [$tmp] );
}
$encodings = array_keys ( $encodings );
$detected = mb_detect_encoding ( $str, $encodings, true );
return ( string ) $detected;
}
Try this out.
The example I have used was something I was doing in a test environment, you might need to change the code slightly.
I had a text file with the following data in:
test
café
áÁÁÁááá
žžœš¥±
ÆÆÖÖÖasØØ
ß
Then I had a form which took a file input in and performed the following code:
function neatify_files(&$files) {
$tmp = array();
for ($i = 0; $i < count($_FILES); $i++) {
for ($j = 0; $j < count($_FILES[array_keys($_FILES)[$i]]["name"]); $j++) {
$tmp[array_keys($_FILES)[$i]][$j]["name"] = $_FILES[array_keys($_FILES)[$i]]["name"][$j];
$tmp[array_keys($_FILES)[$i]][$j]["type"] = $_FILES[array_keys($_FILES)[$i]]["type"][$j];
$tmp[array_keys($_FILES)[$i]][$j]["tmp_name"] = $_FILES[array_keys($_FILES)[$i]]["tmp_name"][$j];
$tmp[array_keys($_FILES)[$i]][$j]["error"] = $_FILES[array_keys($_FILES)[$i]]["error"][$j];
$tmp[array_keys($_FILES)[$i]][$j]["size"] = $_FILES[array_keys($_FILES)[$i]]["size"][$j];
}
}
return $files = $tmp;
}
if (isset($_POST["submit"])) {
neatify_files($_FILES);
$file = $_FILES["file"][0];
$handle = fopen($file["tmp_name"], "r");
while ($line = fgets($handle)) {
$enc = mb_detect_encoding($line, "UTF-8", true);
if (strtolower($enc) != "utf-8") {
echo "<p>" . (iconv($enc, "UTF-8", $line)) . "</p>";
} else {
echo "<p>$line</p>";
}
}
}
?>
<form action="<?= $_SERVER["PHP_SELF"]; ?>" method="POST" enctype="multipart/form-data">
<input type="file" name="file[]" />
<input type="submit" name="submit" value="Submit" />
</form>
The function neatify_files
is something I wrote to make the $_FILES
array more logical in its layout.
The form is a standard form that simply POST
s the data to the server.
Note: Using $_SERVER["PHP_SELF"]
is a security risk, see here for more.
When the data is posted I store the file in a variable. Obviously, if you are using the multiple
attribute your code won't look quite like this.
$handle
stores the entire contents of the text file, in a read-only format; hence the "r"
argument.
$enc
uses the mb_detect_encoding
function to detect the encoding (duh).
At first I was having trouble with obtaining the correct encoding. Setting the encoding_list
to use only UTF-8, and setting strict
to be true.
If the encoding is UTF-8 then I simply print the line, if it didn't I converted it to UTF-8 using the iconv
function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With