Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a zip archive with Unicode filenames using Go's archive/zip

package main

import (
    "archive/zip"
    "fmt"
    "io"
    "os"
    "path/filepath"
    "strings"
)

func main() {
    var (
        Path = os.Args[1]
        Name = os.Args[2]
    )

    File, _ := os.Create(Name)
    PS := strings.Split(Path, "\\")
    PathName := strings.Join(PS[:len(PS)-1], "\\")
    os.Chdir(PathName)
    Path = PS[len(PS)-1]
    defer File.Close()
    Zip := zip.NewWriter(File)
    defer Zip.Close()
    walk := func(Path string, info os.FileInfo, err error) error {
        if err != nil {
            fmt.Println(err)
            return err
        }
        if info.IsDir() {
            return nil
        }
        Src, _ := os.Open(Path)
        defer Src.Close()
        fmt.Println(Path)
        FileName, _ := Zip.Create(Path)
        io.Copy(FileName, Src)
        Zip.Flush()
        return nil
    }
    if err := filepath.Walk(Path, walk); err != nil {
        fmt.Println(err)
    }
}

This mydir path :

-----root
    |---2015-05(dir)
         |---中文.go
    |---package(dir)
    |---你好.go

When I use this code directory, Chinese will be garbled. Who can help me solve the problem.

like image 631
xichen Avatar asked Jan 08 '23 11:01

xichen


1 Answers

The problem is that by default in zip entry names only the ASCII characters are allowed by the Zip specification, more specifically: (Source: APPENDIX D)

APPENDIX D.1 The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437. This limits storing file name characters to only those within the original MS-DOS range of values and does not properly support file names in other character encodings, or languages. To address this limitation, this specification will support the following change.

Later support for Unicode names has been added. This can be marked with a special bit referred to as general purpose bit 11, also called Language encoding flag (EFS):

Section 4.4.4 - General purpose bit flag - Bit 11 - Language encoding flag (EFS). If this bit is set, the filename and comment fields for this file MUST be encoded using UTF-8.

APPENDIX D.2 If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification. The Unicode Standard is published by the The Unicode Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files is expected to not include a byte order mark (BOM).

The general purpose bit flag is present and supported by Go: it is the Flags field of the FileHeader struct. Unfortunately Go doesn't have methods to set this bit, and by default it is 0.

So the easiest way to add support for Unicode names is to simply set bit 11 to one. Instead of

FileName, _ := Zip.Create(Path)

Start your zip entry with:

h := &zip.FileHeader{Name:Path, Method: zip.Deflate, Flags: 0x800}
FileName, _ := Zip.CreateHeader(h)

The first line creates a FileHeader in which 0x800 (bit 11) value is set for the Flags field which tells that the file name will be encoded using UTF-8 (which is what Go does when it writes a string to an io.Writer).

Note:

By doing this, UTF-8 filenames will be preserved, but not all zip reader/extractor supports it. For example on Windows, the windows file handler, the Windows Explorer will not decode it as UTF-8, but for example a more serious Zip handler (e.g. SecureZip) will see the UTF-8 file names and will extract the file names properly (using UTF-8 decoding).

like image 50
icza Avatar answered Jan 15 '23 02:01

icza