Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a string list with multiple values into JSON

I have about thirty thousand records with a string column that has been stored in the following format, with different keys:

"something: this, this and that, that, other stuff, another: name, another name, last: here"

In rails, I want to change it into a hash like

{
    something: [ "this", "this and that", "that" ],
    another: [ "name", "another name" ],
    last: [ "here" ]   
}

Is there a way to do this elegantly? I was thinking of splitting at the colon, then doing a reverse search of the first space.

like image 356
michael Avatar asked Mar 04 '26 23:03

michael


2 Answers

There are about a hundred ways to solve this. A pretty straightforward one is this:

str = "something: this, this and that, that, other stuff, another: name, another name, last: here"

key = nil
str.scan(/\s*([^,:]+)(:)?\s*/).each_with_object({}) do |(val, colon), hsh|
  if colon
    key = val.to_sym
    hsh[key] = []
  else
    hsh[key] << val
  end
end
# => {
#      something: ["this", "this and that", "that", "other stuff"], 
#      another: ["name", "another name"],
#      last: ["here"]
#    }

It works by scanning the string with the following regular expression:

/
  \s*      # any amount of optional whitespace
  ([^,:]+) # one or more characters that aren't , or : (capture 1)
  (:)?     # an optional trailing : (capture 2)
  \s*     # any amount of optional whitespace
/x

Then it iterates over the matches and puts them into a hash. When a match has a trailing colon (capture 2), a new hash key is created with an empty array for a value. Otherwise the value (capture 1) is added to the array for the most recent key.

Or…

A somewhat less straightforward but cleverer approach is to let the RegExp do more work:

MATCH_LIST_ENTRY = /([^:]+):\s*((?:[^,]+(?:,\s*|$))+?)(?=[^:,]+:|$)/

def parse_list2(str)
  str.scan(MATCH_LIST_ENTRY).map do |k, vs|
    [k.to_sym, vs.split(/,\s*/)]
  end.to_h
end

I won't pick apart the RegExp for this one, but it's simpler than it looks. Regexper does a pretty good job of explaining it.

You can see both of these in action on repl.it here: https://repl.it/@jrunning/LongtermMidnightblueAssembler

like image 70
Jordan Running Avatar answered Mar 06 '26 15:03

Jordan Running


If str is the string given in the example, the desired hash can be constructed as follows.

str.split(/, *(?=\p{L}+:)/).
    each_with_object({}) do |s,h|
      k, v = s.split(/: +/)
      h[k.to_sym]= v.split(/, */)
    end
  #=> {:something=>["this", "this and that", "that", "other stuff"],
  #    :another=>["name", "another name"],
  #    :last=>["here"]} 

Note:

str.split(/, *(?=\p{L}+:)/)
  #=> ["something: this, this and that, that, other stuff",
  #    "another: name, another name",
  #    "last: here"] 

This regular expression reads, "match a comma followed by zero or more spaces, the match to be immediately followed by one or more Unicode letters followed by a colon, (?=\p{L}+:) being a positive lookahead".

like image 31
Cary Swoveland Avatar answered Mar 06 '26 14:03

Cary Swoveland



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!