Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split string with "." (dot) while handling abbreviations

Tags:

java

regex

I'm finding this fairly hard to explain, so I'll kick off with a few examples of before/after of what I'd like to achieve.

Example of input:

Hello.World

This.Is.A.Test

The.S.W.A.T.Team

S.W.A.T.

s.w.a.t.

2001.A.Space.Odyssey

Wanted output:

Hello World

This Is A Test

The SWAT Team

SWAT

swat

2001 A Space Odyssey

Essentially, I'd like to create something that's capable of splitting strings by dots, but at the same time handles abbreviations.

My definition of an abbreviation is something that has at least two characters (casing irrelevant) and two dots, i.e. "A.B." or "a.b.". It shouldn't work with digits, i.e. "1.a.".

I've tried all kinds of things with regex, but it isn't exactly my strong suit, so I'm hoping that someone here has any ideas or pointers that I can use.

like image 275
Michell Bak Avatar asked Jun 13 '13 23:06

Michell Bak


1 Answers

How about removing dots that need to disappear with regex, and then replace rest of dots with space? Regex can look like (?<=(^|[.])[\\S&&\\D])[.](?=[\\S&&\\D]([.]|$)).

String[] data = { 
        "Hello.World", 
        "This.Is.A.Test", 
        "The.S.W.A.T.Team",
        "S.w.a.T.", 
        "S.w.a.T.1", 
        "2001.A.Space.Odyssey" };

for (String s : data) {
    System.out.println(s.replaceAll(
            "(?<=(^|[.])[\\S&&\\D])[.](?=[\\S&&\\D]([.]|$))", "")
            .replace('.', ' '));
}

result

Hello World
This Is A Test
The SWAT Team
SwaT 
SwaT 1
2001 A Space Odyssey

In regex I needed to escape special meaning of dot characters. I could do it with \\. but I prefer [.].

So at canter of regex we have dot literal. Now this dot is surrounded with (?<=...) and (?=...). These are parts of look-around mechanism called look-behind and look-ahead.

  • Since dots that need to be removed have dot (or start of data ^) and some non-white-space \\S that is also non-digit \D character before it I can test it using (?<=(^|[.])[\\S&&\\D])[.].

  • Also dot that needs to be removed have also non-white-space and non-digit character and another dot (optionally end of data $) after it, which can be written as [.](?=[\\S&&\\D]([.]|$))


Depending on needs [\\S&&\\D] which beside letters also matches characters like !@#$%^&*()-_=+... can be replaced with [a-zA-Z] for only English letters, or \\p{IsAlphabetic} for all letters in Unicode.

like image 92
Pshemo Avatar answered Oct 14 '22 19:10

Pshemo