Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Character wrongly encoded

I've got files containing the character in their name.

These files are well handled under Linux (Apache/php):

$files = scandir($path);
echo json_encode($files);

file1⌐
file2⌐
file3⌐
file4⌐

Under Windows they seem to be read as Windows-1252 by the file system, so I had to conditionally convert them so that json_encode could work

$files = scandir($path);
foreach ($files as $i => $file) {
    $files[$i] = mb_convert_encoding($file, 'UTF-8', 'Windows-1252');
}
echo json_encode($files);

Here is how they get converted

file1¬
file2¬
file3¬
file4¬

Why is getting converted to ¬ and how can I get the original character ?

like image 840
Pierre de LESPINAY Avatar asked Mar 11 '16 10:03

Pierre de LESPINAY


People also ask

How do you fix encoding issues?

Go to Settings, open the Video tab and lower the output resolution. Then select Common FPS Values and choose 30 or less. Open the Output tab. Choose the superfast or ultrafast encoder preset.

What is a character encoding issue?

Problem. Computers store text as a sequence of numbers where each character has a unique number according to an agreed upon "character encoding standard". The problem is that there are many standards and each standard assigns different numbers to the same character.

What is an invalid UTF-8 character?

Non-UTF-8 characters are characters that are not supported by UTF-8 encoding and, they may include symbols or characters from foreign unsupported languages. We'll get an error if we attempt to store these characters to a variable or run a file that contains them.


1 Answers

Please try unpack('C*', $char) on the critical character ⌐ of your filenames. Then you will notice that it is already 0xAC (which is ¬).

The reason for this is, that scandir() uses an 8-bit ANSI Api of Windows that does a substitution and provides some "closest matching character" for characters that are not on Windows-1252. You can observe the same behaviour if you get the text-editor notepad++, set it to ANSI and try to copy&paste your ⌐ into it. It will show up as ¬ (and funnily it also changed in the c&p buffer when I tried it on my system).

What can you do? Well here are some options:

  1. Use shell_exec('dir /b') on Windows (I tested this, you get the original character)
  2. Assume that ¬ means ⌐ for filenames on Windows and just replace it back after utf-8 conversion
  3. Change your software system so the character ⌐ is no longer used in filenames
  4. Use some experimental php build that has the function stream_encoding and try the code below. (NB: stream_encoding is undefined, even with mbstring loaded, in the following official builds: 5.6.19 7.0.4)
$myContext = stream_context_create();
stream_encoding($myContext, 'UTF-8');
$files = scandir('./', SCANDIR_SORT_ASCENDING, $myContext);

While shell_exec should be generally avoided, I think this is your best option for now. On the long term you should go for 3 if you can. I would not recommend 4. (also I have not tested this) and I do not know enough about your scenario to tell if 2. is viable.

like image 119
Freitags Avatar answered Nov 13 '22 15:11

Freitags