Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract attributes of an string

I got to deal here with a problem, caused by a dirty design. I get a list of string and want to parse attributes out of it. Unfortunately, I can't change the source, where these String were created.

Example:

String s = "type=INFO, languageCode=EN-GB, url=http://www.stackoverflow.com, ref=1, info=Text, that may contain all kind of chars., deactivated=false"

Now I want to extract the attributes type, languageCode, url, ref, info and deactivated.

The problem here is the field info, whose text is not limited by quote mark. Also commas may occur in this field, so I can't use the comma at the end of the string, to find out where is ends.

Additional, those strings not always contain all attributes. type, info and deactivated are always present, the rest is optional.

Any suggestions how I can solve this problem?

like image 590
Euestros Avatar asked Jun 03 '15 21:06

Euestros


3 Answers

One possible solution is to search for = characters in the input and then take the single word immediately before it as the field name - it seems that all your field names are single words (no whitespace). If that's the case, you can then take everything after the = until the next field name (accounting for separating ,) as the value.

This assumes that the value cannot contain =.

Edit:

As a possible way to handle embedded =, you can see if the word in front of it is one your known field names - if not, you can possibly treat the = as an embedded character rather than an operator. This, however, assumes that you have a fixed set of known fields (some of which may not always appear). This assumption may be eased if you know that the field names are case-sensitive.

like image 93
xxbbcc Avatar answered Sep 30 '22 03:09

xxbbcc


Assuming that order of elements is fixed you could write solution using regex like this one

String s = "type=INFO, languageCode=EN-GB, url=http://www.stackoverflow.com, ref=1, info=Text, that may contain all kind of chars., deactivated=false";

String regex = //type, info and deactivated are always present
          "type=(?<type>.*?)"
        + "(?:, languageCode=(?<languageCode>.*?))?"//optional group
        + "(?:, url=(?<url>.*?))?"//optional group
        + "(?:, ref=(?<rel>.*?))?"//optional group
        + ", info=(?<info>.*?)"
        + ", deactivated=(?<deactivated>.*?)";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(s);
if(m.matches()){
    System.out.println("type -> "+m.group("type"));
    System.out.println("languageCode -> "+m.group("languageCode"));
    System.out.println("url -> "+m.group("url"));
    System.out.println("rel -> "+m.group("rel"));
    System.out.println("info -> "+m.group("info"));
    System.out.println("deactivated -> "+m.group("deactivated"));
}

Output:

type -> INFO
languageCode -> EN-GB
url -> http://www.stackoverflow.com
rel -> 1
info -> Text, that may contain all kind of chars.
deactivated -> false

EDIT: Version2 regex searching for oneOfPossibleKeys=value where value ends with:

  • , oneOfPossibleKeys=
  • or has end of string after it (represented by $).

Code:

String s = "type=INFO, languageCode=EN-GB, url=http://www.stackoverflow.com, ref=1, info=Text, that may contain all kind of chars., deactivated=false";

String[] possibleKeys = {"type","languageCode","url","ref","info","deactivated"};
String keysStrRegex = String.join("|", possibleKeys);
//above will contain type|languageCode|url|ref|info|deactivated

String regex = "(?<key>\\b(?:"+keysStrRegex+")\\b)=(?<value>.*?(?=, (?:"+keysStrRegex+")=|$))";
    // (?<key>\b(?:type|languageCode|url|ref|info|deactivated)\b)
    // =
    // (?<value>.*?(?=, (?:type|languageCode|url|ref|info|deactivated)=|$))System.out.println(regex);

Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(s);


while(m.find()){
    System.out.println(m.group("key")+" -> "+m.group("value"));
}

Output:

type -> INFO
languageCode -> EN-GB
url -> http://www.stackoverflow.com
ref -> 1
info -> Text, that may contain all kind of chars.
deactivated -> false
like image 26
Pshemo Avatar answered Sep 30 '22 02:09

Pshemo


You could use a regular expression, capturing all the "fixed" groups and using whatever remains for info. This should even work if the info part contains , or = characters. Here's some quick example (using Python, but that should not be a problem...).

>>> p = r"(type=[A-Z]+), (languageCode=[-A-Z]+), (url=[^,]+), (ref=\d), (info=.+?), (deactivated=(?:true|false))"
>>> s = "type=INFO, languageCode=EN-GB, url=http://www.stackoverflow.com, ref=1, info=Text, that may contain all kind of chars, even deactivated=true., deactivated=false"
>>> re.search(p, s).groups()
('type=INFO',
 'languageCode=EN-GB',
 'url=http://www.stackoverflow.com',
 'ref=1',
 'info=Text, that may contain all kind of chars, even deactivated=true.',
 'deactivated=false')

If any of those elements are optional, you can put a ? after those groups, and make the comma optional. If the order can be different, then it's more complicated. In this case, instead of using one RegEx to capture everything at once, use several RegExes to capture the individual attributes and then remove (replace with '') those in the string before matching the next attribute. Finally, match info.


On further consideration, given that those attributes could have any order, it may be more promising to capture just everything spanning from one keyword to the next, regardless of its actual content, very similar to Pshemo's solution:

keys = "type|languageCode|url|ref|info|deactivated"
p = r"({0})=(.+?)(?=\, (?:{0})=|$)".format(keys)
matches = re.findall(p, s)

But this, too, might fail in some very obscure cases, e.g. if the info attribute contains something like ', ref=foo', including the comma. However, there seems to be no way around those ambiguities. If you had a string like info=in this string, ref=1, and in another, ref=2, ref=1, does it contain one ref attribute, or three, or none at all?

like image 24
tobias_k Avatar answered Sep 30 '22 03:09

tobias_k