I am looking for a way to compare two Java strings that are lexicographically equivalent but not identical at the byte level. More precisely take the following file name "baaaé.png", at the byte level it can be represented in two different ways: [98, 97, 97, 97, -61, -87, 46, 112, 110, 103] --> the "é" is encoded with 2 bytes [98, 97, 97, 97, 101, -52, -127, 46, 112, 110, 103] --> the "é" is encoded with 3 bytes <pre class="prettyprint"><code> byte[] ch = {98, 97, 97, 97, -61, -87, 46, 112, 110, 103}; byte[] ff = {98, 97, 97, 97, 101, -52, -127, 46, 112, 110, 103}; String st = new String(ch,"UTF-8"); String st2 = new String(ff,"UTF-8"); System.out.println(st); System.out.println(st2); System.out.println(st.equals(st2)); </code></pre> Will generate the following output: <pre class="prettyprint"><code>baaaé.png baaaé.png false </code></pre> Is there a way to do the compare so that the equals method returns true ?

You can use the Collator class with an applicable strength to normalize out things like different accent marks. this will allow you to compare strings successsfully. In this case, a US locale and a TERTIARY strength is enough to get the strings to be equal <pre class="prettyprint"><code>Collator usCollator = Collator.getInstance(); usCollator.setStrength(Collator.TERTIARY); System.out.println(usCollator.equals(st, st2)); </code></pre> outputs <pre class="prettyprint"><code>true </code></pre> You can also use Java's Normalizer class to convert between different forms of Unicode. This will transform your strings, but they will end up being the same, allowing you to use standard string tools to do the comparison Finally, take might want to take a look at the ICU (International Components for Unicode) project, which provides lots of tools for working with Unicode strings in lots of different ways.

Compare two strings that are lexicographically equivalent but not identical at the byte level

I am looking for a way to compare two Java strings that are lexicographically equivalent but not identical at the byte level.

More precisely take the following file name "baaaé.png", at the byte level it can be represented in two different ways:

[98, 97, 97, 97, -61, -87, 46, 112, 110, 103] --> the "é" is encoded with 2 bytes

[98, 97, 97, 97, 101, -52, -127, 46, 112, 110, 103] --> the "é" is encoded with 3 bytes

    byte[] ch = {98, 97, 97, 97, -61, -87, 46, 112, 110, 103};
    byte[] ff = {98, 97, 97, 97, 101, -52, -127, 46, 112, 110, 103};

    String st = new String(ch,"UTF-8");
    String st2 = new String(ff,"UTF-8");
    System.out.println(st);
    System.out.println(st2);
    System.out.println(st.equals(st2));

Will generate the following output:

baaaé.png
baaaé.png
false

Is there a way to do the compare so that the equals method returns true ?

What does it mean to compare strings lexicographically?

Two strings are lexicographically equal if they are the same length and contain the same characters in the same positions.

How do you compare two strings lexicographically in CPP?

String strcmp() function in C++ The strcmp() function is a C library function used to compare two strings in a lexicographical manner. Syntax: int strcmp ( const char * str1, const char * str2 ); The function returns 0 if both the strings are equal or the same.

Which method is used to compare two strings ignoring the case?

The equalsIgnoreCase() method compares two strings, ignoring lower case and upper case differences. This method returns true if the strings are equal, and false if not. Tip: Use the compareToIgnoreCase() method to compare two strings lexicographically, ignoring case differences.

You can use the Collator class with an applicable strength to normalize out things like different accent marks. this will allow you to compare strings successsfully.

In this case, a US locale and a TERTIARY strength is enough to get the strings to be equal

Collator usCollator = Collator.getInstance();
usCollator.setStrength(Collator.TERTIARY);
System.out.println(usCollator.equals(st, st2));

outputs

true

You can also use Java's Normalizer class to convert between different forms of Unicode. This will transform your strings, but they will end up being the same, allowing you to use standard string tools to do the comparison

Finally, take might want to take a look at the ICU (International Components for Unicode) project, which provides lots of tools for working with Unicode strings in lots of different ways.

Compare two strings that are lexicographically equivalent but not identical at the byte level

Tags:

java

string

utf-8

Davz

People also ask

1 Answers

Peter Elliott

Recent Activity

Donate For Us

Compare two strings that are lexicographically equivalent but not identical at the byte level

Tags:

java

string

utf-8

Davz

People also ask

1 Answers

Peter Elliott

Related questions

Recent Activity

Donate For Us