I'm finding this fairly hard to explain, so I'll kick off with a few examples of before/after of what I'd like to achieve.
Example of input:
Hello.World
This.Is.A.Test
The.S.W.A.T.Team
S.W.A.T.
s.w.a.t.
2001.A.Space.Odyssey
Wanted output:
Hello World
This Is A Test
The SWAT Team
SWAT
swat
2001 A Space Odyssey
Essentially, I'd like to create something that's capable of splitting strings by dots, but at the same time handles abbreviations.
My definition of an abbreviation is something that has at least two characters (casing irrelevant) and two dots, i.e. "A.B." or "a.b.". It shouldn't work with digits, i.e. "1.a.".
I've tried all kinds of things with regex, but it isn't exactly my strong suit, so I'm hoping that someone here has any ideas or pointers that I can use.
How about removing dots that need to disappear with regex, and then replace rest of dots with space? Regex can look like (?<=(^|[.])[\\S&&\\D])[.](?=[\\S&&\\D]([.]|$))
.
String[] data = {
"Hello.World",
"This.Is.A.Test",
"The.S.W.A.T.Team",
"S.w.a.T.",
"S.w.a.T.1",
"2001.A.Space.Odyssey" };
for (String s : data) {
System.out.println(s.replaceAll(
"(?<=(^|[.])[\\S&&\\D])[.](?=[\\S&&\\D]([.]|$))", "")
.replace('.', ' '));
}
result
Hello World
This Is A Test
The SWAT Team
SwaT
SwaT 1
2001 A Space Odyssey
In regex I needed to escape special meaning of dot characters. I could do it with \\.
but I prefer [.]
.
So at canter of regex we have dot literal. Now this dot is surrounded with (?<=...)
and (?=...)
. These are parts of look-around mechanism called look-behind and look-ahead.
Since dots that need to be removed have dot (or start of data ^
) and some non-white-space \\S
that is also non-digit \D character before it I can test it using (?<=(^|[.])[\\S&&\\D])[.]
.
Also dot that needs to be removed have also non-white-space and non-digit character and another dot (optionally end of data $
) after it, which can be written as [.](?=[\\S&&\\D]([.]|$))
Depending on needs [\\S&&\\D]
which beside letters also matches characters like !@#$%^&*()-_=+...
can be replaced with [a-zA-Z]
for only English letters, or \\p{IsAlphabetic}
for all letters in Unicode.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With