Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java- Split String which is alphanumeric

Tags:

java

string

split

Example input:

RC23
CC23QQ21HD32
BPOASDf91A5HH123

Example output:

[RC,23]
[CC,23,QQ,21,HD,32]
[BPOASDf,91,A,5,HH,123]

The length of alpha part and numeric part is not fixed.

I know how to use split() with regex like //.' ' ' '([a-z]) but although I checked split() Java API I can't find anything that can help me to solve this problem.

Is there a way to use split() to do this? Or I need to use another method to split these string.

Any help would be appreciated.

like image 726
railgun210 Avatar asked Dec 20 '22 08:12

railgun210


2 Answers

Try this regex: "((?<=[a-zA-Z])(?=[0-9]))|((?<=[0-9])(?=[a-zA-Z]))"

Here's a running example: http://ideone.com/c02rmM

{
    ...
    String someString = "CC23QQ21HD32";
    String regex = "((?<=[a-zA-Z])(?=[0-9]))|((?<=[0-9])(?=[a-zA-Z]))";
    System.out.println(Arrays.asList(someString.split(regex)));
    //outputs [CC, 23, QQ, 21, HD, 32]
    ...
}

The regex is using lookahead (?=ValueToMatch) and look behinds (?<=ValueToMatch).

The first half of it (before the | ) is asking: "Is the previous character a letter (?<=[a-zA-Z])? Is the next character a digit (?=[0-9])?" If both are true, it'll match the string to the regex.

The second half of that regex is doing it the other way around. It asks: "Is the previous character a digit (?<=[0-9])? Is the next character a letter? (?=[a-zA-Z])", and again it'll match if both are true.

Normally the split() would remove the characters matched by the regex. This remains true even to this regex. However, since the regex is matching a 0-width lookahead, the actual characters you're looking for are not removed.

Check out Adam Paynter's answer for more on lookaheads and look behinds: how to split string with some separator but without removing that separator in Java?

like image 139
Cramps Avatar answered Jan 07 '23 12:01

Cramps


You can match on 1 or more contiguous alpha characters or 1 or more contiguous numerical characters. Once the sequence is interrupted stop matching, store the sequence an then start over. Non-word characters will be ignored entirely.

Edit: I created a simple performance test below to show the speed between using String.split() and Pattern.matcher(). The split version is 2.5x faster than the matcher+loop version.

Solution

private static String[] splitAlphaNumeric(String str) {
    return str.split("(?i)((?<=[A-Z])(?=\\d))|((?<=\\d)(?=[A-Z]))");
}

Example

import java.util.*;
import java.util.regex.*;

public class SplitAlphaNumeric {
    private static final Pattern ALPH_NUM_PAT = Pattern.compile("[0-9]+|[A-Z]+");

    private static List<String> input = Arrays.asList(
        "RC23",
        "CC23QQ21HD32",
        "BPOASDf91A5HH123"
    );

    public static void main(String[] args) {
        System.out.printf("Execution time: %dns%n", testMatch());
        System.out.printf("Execution time: %dns%n", testSplit());
    }

    public static long testMatch() {
        System.out.println("Begin Test 1...");
        long start = System.nanoTime();
        for (String str : input) {
            System.out.printf("%-16s -> %s%n", str, parse(str));
        }
        long end = System.nanoTime();
        return end - start;
    }

    public static long testSplit() {
        System.out.println("\nBegin Test 2...");
        long start = System.nanoTime();
        for (String str : input) {
            System.out.printf("%-16s -> %s%n", str, parse2(str));
        }
        long end = System.nanoTime();
        return end - start;
    }

    private static List<String> parse(String str) {
        List<String> parts = new LinkedList<String>();
        Matcher matcher = ALPH_NUM_PAT.matcher(str);
        while (matcher.find()) {
            parts.add(matcher.group());
        }
        return parts;
    }

    private static List<String> parse2(String str) {
        return Arrays.asList(str.split("(?i)((?<=[A-Z])(?=\\d))|((?<=\\d)(?=[A-Z]))"));
    }
}

Output

Begin Test 1...
RC23             -> [RC, 23]
CC23QQ21HD32     -> [CC, 23, QQ, 21, HD, 32]
BPOASDf91A5HH123 -> [BPOASD, 91, A, 5, HH, 123]
Execution time: 4879125ns

Begin Test 2...
RC23             -> [RC, 23]
CC23QQ21HD32     -> [CC, 23, QQ, 21, HD, 32]
BPOASDf91A5HH123 -> [BPOASDf, 91, A, 5, HH, 123]
Execution time: 1953349ns
like image 41
Mr. Polywhirl Avatar answered Jan 07 '23 11:01

Mr. Polywhirl