Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the difference between a string in the source code and a string read from a file?

there is a file named "dd.txt" in my disk, it's content is \u5730\u7406

now ,when i run this program

public static void main(String[] args) throws IOException {
    FileInputStream fis=new FileInputStream("d:\\dd.txt");
    ByteArrayOutputStream baos=new ByteArrayOutputStream();
    byte[] buffer=new byte[fis.available()];
    while ((fis.read(buffer))!=-1) {
        baos.write(buffer);
    }
    String s1="\u5730\u7406";
    String s2=baos.toString("utf-8");
    System.out.println("s1:"+s1+"\n"+"s2:"+s2);
}

and i got different result

s1:地理
s2:\u5730\u7406

can you tell me why? and how i can read that file and get the same result like s1 in chinese?

like image 369
Paul Wang Avatar asked Jul 14 '15 07:07

Paul Wang


People also ask

How do you read a string from a file in Java?

The readString() method of File Class in Java is used to read contents to the specified file. Return Value: This method returns the content of the file in String format. Note: File. readString() method was introduced in Java 11 and this method is used to read a file's content into String.

How to read a file and store it in String in Java?

Below is the code snippet to read the file to String using BufferedReader. BufferedReader reader = new BufferedReader(new FileReader(fileName)); StringBuilder stringBuilder = new StringBuilder(); String line = null; String ls = System. getProperty("line. separator"); while ((line = reader.


2 Answers

When you write \u5730 in Java code, it's interpreted as a single unicode character (a unicode literal) by the compiler. When you write the same to a file, it's just 6 regular characters (because there's nothing interpreting it). Is there a reason why you're not writing 地理 directly to the file?

If you wish to read the file containing the unicode literals, you'll need to parse the values yourself, throwing away the \u and parsing the unicode codepoint yourself. It's a lot easier to just write proper unicode with a suitable encoding (e.g. UTF-8) in the file in the first place if you control the creation of the file, and under normal circumstances you should never come across files containing these escaped unicode literals.

like image 118
Kayaman Avatar answered Sep 28 '22 16:09

Kayaman


In your Java code, the \uxxxx are interpreted as Unicode literals, so they are shown as Chinese characters. This is only done so because the compiler is instructed to do so.

To obtain the same result, you have to do some parsing yourself:

String[] hexCodes = s2.split("\\\\u");
for (String hexCode : hexCodes) {
    if (hexCode.length() == 0)
        continue;
    int intValue = Integer.parseInt(hexCode, 16);
    System.out.print((char)intValue);
}

(note that this only works if every character is in Unicode literal form, e.g. \uxxxx)

like image 32
Glorfindel Avatar answered Sep 28 '22 18:09

Glorfindel