Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple simultaneous substring replacements in Java

(I come from the python world, so I apologise if some of the terminology I use jars with the norm.)

I have a String with a List of start/end indices to replace. Without getting too much into detail, consider this basic mockup:

String text = "my email is [email protected] and my number is (213)-XXX-XXXX"
List<Token> findings = SomeModule.someFnc(text);

And Token has the definition of

class Token {
    int start, end;
    String type;
}

This List represents start and end positions of sensitive data that I'm trying to redact.

Effectively, the API returns data that I iterate over to get:

[{ "start" : 12, "end" : 22, "type" : "EMAIL_ADDRESS" }, { "start" : 41, "end" : 54, "type" : "PHONE_NUMBER" }]

Using this data, my end goal is to redact the tokens in text specified by these Token objects to get this:

"my email is [EMAIL_ADDRESS] and my number is [PHONE_NUMBER]"

The thing that makes this question non-trivial is that the replacement substrings aren't always the same length as the substrings they're replacing.

My current plan of action is to build a StringBuilder from text, sort these IDs in reverse order of start indices, and then replace from the right end of the buffer.

But something tells me there should be a better way... is there?

like image 389
cs95 Avatar asked Jun 19 '18 05:06

cs95


2 Answers

This approach works:

import java.util.ArrayList;
import java.util.List;

public class Test {
    public static void main(String[] args) {
        String text = "my email is [email protected] and my number is (213)-XXX-XXXX";

        List<Token> findings = new ArrayList<>();
        findings.add(new Token(12, 22, "EMAIL_ADDRESS"));
        findings.add(new Token(41, 54, "PHONE_NUMBER"));

        System.out.println(replace(text, findings));
    }

    public static String replace(String text, List<Token> findings) {
        int position = 0;
        StringBuilder result = new StringBuilder();

        for (Token finding : findings) {
            result.append(text.substring(position, finding.start));
            result.append('[').append(finding.type).append(']');

            position = finding.end + 1;
        }

        return result.append(text.substring(position)).toString();
    }
}

class Token {
    int start, end;
    String type;

    Token(int start, int end, String type) {
        this.start = start;
        this.end = end;
        this.type = type;
    }
}

Output:

my email is [EMAIL_ADDRESS] and my number is [PHONE_NUMBER]
like image 199
Robby Cornelissen Avatar answered Oct 03 '22 23:10

Robby Cornelissen


Ensure that all tokens are sorted by start index in ascending order:

List<Token> tokens = new ArrayList<>();
tokens.sort(Comparator.comparing(Token::getStart));

Now you can replace all strings starting from the end of the input text:

public String replace(String text, List<Token> tokens) {
    StringBuilder sb = new StringBuilder(text);
    for (int i = tokens.size() - 1; i >= 0; i--) {
        Token token = tokens.get(i);
        sb.replace(token.start, token.end + 1, "[" + token.type + "]");
    }
    return sb.toString();
}
like image 35
Oleksandr Pyrohov Avatar answered Oct 03 '22 22:10

Oleksandr Pyrohov