Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

give an example of using cyirillic in regex java

Tags:

java

string

regex

How to make regex of a cyrillic string, i want to use it in this a way somehow:

String.replaceAll("Кириллица","")

Of course it doesn't work. What am I to do, to make it work?

Ok,I see that the method works, but it doesn't work for me. How can I check, why does method not execute?

...

Hm, I tried to use s1 = s1.replaceAll("[\\p{InCyrillic}]", ""); for the string I get through the sockets. it works great, all cyrillic chars disapperar, including the word "Экзамен", but if I try s1=s1.replaceAll("Экзамен","") nothing happens.

But method s1=s1.replaceAll("Экзамен","") worked in the same program for a static string defined in this program. I guess that problem may be because of wrong charset, but I still can't understand what am I doing wrong. The charset of the string is windows-1251. I tried to experiment with charset in program (it is jsp now), using methods

System.setProperty("file.encoding", "windows-1251");
response.setCharacterEncoding("windows-1251"); 

tried converting the string from one charset to another. And nothing changes

like image 638
user1956641 Avatar asked Jan 15 '13 17:01

user1956641


People also ask

Are Cyrillic characters UTF 8?

UTF-8. 128 characters are encoded using 1 byte (the ASCII characters). 1920 characters are encoded using 2 bytes (Roman, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic characters). 63488 characters are encoded using 3 bytes (Chinese and Japanese among others).

Is Cyrillic in Unicode?

Cyrillic is a Unicode block containing the characters used to write the most widely used languages with a Cyrillic orthography. The core of the block is based on the ISO 8859-5 standard, with additions for minority languages and historic orthographies.

Which regex is applicable for alphabets?

[A-Za-z] will match all the alphabets (both lowercase and uppercase).

What is regex example?

1.2 Example: Numbers [0-9]+ or \d+ A regex (regular expression) consists of a sequence of sub-expressions. In this example, [0-9] and + . The [...] , known as character class (or bracket list), encloses a list of characters. It matches any SINGLE character in the list.


2 Answers

It might be more clear if you show your result in case @Henry's answer. I suppose that the issue in characters or encoding. To identify is the String in cyrillic you can with this code:

String s1 = "Экзaмен";
s1 = s1.replaceAll("[\\p{InCyrillic}]", "");
System.out.println(s1);

The code will remove all cyrillic characters and you can identify invalid encoded characters.

If your result will be like "a" or "e", or "ae", It means that in your string exist latin characters which simular to cyrillic, so you should replace using this regex

 s1 = s1.replaceAll("Экз[aa]м[ee]н", "");

where [a-is cyrillic character and a-is latin character] and so on.

If your result will be as "Экзaмен", the issue in encoding and I hope this link will help you

How to determine if a String contains invalid encoded characters

like image 73
Zhandos Avatar answered Sep 22 '22 21:09

Zhandos


Just tried this:

String s1 = "Введение в специальность (Б.3.2.1-ПиКО)60,3Экзамен";
String s2 = s1.replaceAll("Экзамен", "");
System.out.println(s2);

The output is:

Введение в специальность (Б.3.2.1-ПиКО)60,3
like image 29
Henry Avatar answered Sep 25 '22 21:09

Henry