Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unzipping with ExtractToDirectory method distorts non-latin symbols

I have several folders with files, some folders contain non-latin symbols in their names (russian in my case). This folders are sending to zip archive (by windows explorer) in "D:\test.zip". Then I execute method

    ZipFile.ExtractToDirectory(@"D:\test.zip", @"D:\result");

and it successfully unzip all content, but all non-latin symbols turn into something wrong.

For example, instead of "D:\result\каскады\file.txt" I got "D:\result\Є бЄ ¤л\file.txt".

Default encoding of my system is windows-1251 which I verified by involving Encoding.GetEncoding("windows-1251") into third parameter of ExtractToDirectory and getting the same result. I also tried UTF-8, but got another artifacts inside path ("D:\result\��᪠��\file.txt"). Trying Unicode return me message about not supported encoding.

When I create same archive through the code by executing method

    ZipFile.CreateFromDirectory(@"D:\zipdata", @"D:\test.zip");

everything then unzipping fine with the same line of code as in the top of the question, even without specifying particular encodings.

The question is: how to get correct encoding from archive for applying it in ExtractToDirectory method, in respect that in real task archive comes from external source and I can not rely on wherether it created 'by hands' or programmatically?

Edit
There is question where also non-latin symbols (chinese) cause problems, but this fact was given like resolution of question, whereas this is exactly problem for my situation.

like image 983
Sam Avatar asked Sep 04 '15 16:09

Sam


Video Answer


1 Answers

There is no formally standardized ZIP specification. However, the de facto standard is the PKZIP "application note" document, which as of 2006 documents only code page 437 ("OEM United States") and UTF8 as legal text encodings for file entries in the archive:

D.1 The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437. This limits storing file name characters to only those within the original MS-DOS range of values and does not properly support file names in other character encodings, or languages. To address this limitation, this specification will support the following change.

D.2 If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification. The Unicode Standard is published by the The Unicode Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files is expected to not include a byte order mark (BOM).

In other words, it's a bug in any ZIP authoring tool to use any text encoding other than code page 437 or UTF8. Based on your experience, it appears Windows Explorer has this bug. :(

Unfortunately, the "general purpose bit 11" is the only official mechanism for indicating the actual text encoding used in the archive, and this allows only for either the original 437 code page or UTF8. Even this bit was not supported by .NET until .NET 4.5. In any case, even since that time it is not possible for .NET or any other ZIP archive-aware software to reliably determine a non-standard, unsupported encoding used to encode the file entry names in the archive.

However, you can, if the source machine used to create the archive is known and available, determine the default code page installed on that machine, via the CultureInfo class. The following expression will return the code page identifier installed on the machine where the expression is executed (assuming the process hasn't changed its current culture from the default, of course):

System.Globalization.CultureInfo.CurrentCulture.TextInfo.OEMCodePage 

This gives you the code page ID that can be passed to Encoding.GetEncoding(Int32) to retrieve an Encoding object that can then be passed to the appropriate ZipArchive constructor when opening an existing archive, to ensure that the file entry names are decoded correctly.


If you are unable to retrieve the actual text encoding from the machine that is the origin of the archive, then you're stuck enumerating the encodings, trying each one until you find one that reports entry names in a legible format.

As I understand it, Windows 8 and later can support the UTF8 flag in the ZIP archive file. I haven't tried it, but it's possible that such versions of Windows also write archives using that flag. If so, that would (one hopes) mitigate the pain of the earlier Windows bug.


Finally note that a custom tool could record the encoding in a special file entry placed in the archive itself. Of course, only that tool would be able to recognize the special file and use it to determine the correct encoding (the tool would have to open the archive twice: once to retrieve the file, and then a second time once the tool has determined the encoding). This is not an ideal solution and of course is no help for archives created by Windows Explorer. I mention it only for the sake of completeness.

like image 68
Peter Duniho Avatar answered Sep 30 '22 13:09

Peter Duniho