I'm trying to select just names from text like this (Slovak dump of Wikipedia):
|Meno = Hans Joachim
|Plné meno = Aristoteles (???????????)
|Plné meno = Francis Bacon
|Plné meno = Sokrates ({{Cudzojazyčne|grc|????????|pc=n}})
|Meno = Svätý František z Assisi <br /> ''(Giovanni Battista Bernardone)''
|Meno = Friedrich Ludwig Gottlob Frege
|Meno = Adam František Kollár (Kolárik)
|meno = [[J. Edgar Hoover|John Edgar Hoover]]
|meno = [[Benedikt XIV. (1740 – 1758)|Benedikt XIV.]]
|meno = [[Milan Rastislav Štefánik|Milan Rastislav Štefánik]]
|Meno = '''Ján Filc'''
|Meno = Jean le Rond d'Alembert
Output should be like:
Hans Joachim
Aristoteles
Francis Bacon
Sokrates
Svätý František z Assisi
Friedrich Ludwig Gottlob Frege
Adam František Kollár (Kolárik)
J. Edgar Hoover|John Edgar Hoover
Benedikt XIV. (1740 – 1758)|Benedikt XIV.
Milan Rastislav Štefánik|Milan Rastislav Štefánik
Ján Filc
Jean le Rond d'Alembert
When the name is written correctly, this regular expression is working fine: = *(.*?)$
But when there are thing like "(???????????)", HTML tags and something between "{{" and "}}", I cannot select the name without the unwanted substring.
I tried a lot of options on this regex tester page (http://regex101.com/r/gS8iQ9/1), but none of them worked.
In Java code I'm using
Pattern pattern = Pattern.compile("= *(.*?)$");
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
String foundSubstring = matcher.group(1);
...
Thanks for any help or suggestions on how select text after "=" but without question marks, HTML code and so on.
Your regex was almost right, but your input is a bit trick to work with, and you can do it in one line:
String name = line.replaceAll(".*?=[\\[ ']*([\\p{L}0-9|'. ()–]+[\\p{L}.)]).*", "$1");
See live demo
I have tested this and it produced your desired output given your sample input.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With