I want to make the operations on structured text available here using bash script language. However, my knowledge makes the task very challenging.
Input sample:
"4-QUEIJOS": Mucarela Provolone Catupiry Ricota Oregano
"A-MODA": Mucarela Presunto Calabresa Bacon Tomate Milho Oregano
"ALHO-E-OLEO": Mucarela Alho oleo Oregano
"PEITO-DE-PERU-ESPECIAL": Mucarela Peito-de-Peru Catupiry Oregano
Output sample
"4-QUEIJOS": ["mucarela", "provolone", "catupiry", "ricota", "oregano"],
"A-MODA": ["mucarela", "presunto", "calabresa", "bacon", "tomate", "milho", "oregano"],
"ALHO-E-OLEO": ["mucarela", "alho", "oleo", "oregano"],
"PEITO-DE-PERU-ESPECIAL": ["Mucarela", "peito-de-peru", "catupiry", "oregano"]
As you can see above, we need to:
The cherry-at-the-top is the commas at the end of each line except the last.
I just decided I wanted to do a deep-dive on sed and more specifically, to understand @HatLess's sed kungfu in the answer above. I ran the command as posted with --debug
and spent a little more time digging into regex and other sed-isms. It's one thing to get an answer that solves the problem with a one liner, it's another thing to grok what the heck just happened and what it is you did to get the answer - so here is my play-by-play of the above answer... because I am not satisfied just memorizing formulas or patterns!
Lifting the hood to see how the sausage is made is the only way this stuff really sinks in, especially with something like sed. It's like learning the fundamentals of music, once you figure out the patterns, composing your own symphony is not a far stretch..
Let's break this sed command/script down and go step by step:
sed --debug -E ':a;s/([^ ]*) ([^ ]*)/\1"\2",/;ta;s/(:)(.*")/\1 [\L\2]/;$s/,$//;s/,/& /g' input_file
Note the single quotes enclose the set of commands sed will be running, and the semicolons separate the sed commands.
Annotate program execution
--debug
Use extended regular expressions
-E
Mark a spot to jump to for iterations
:a
is a label that can be returned to using t a
Add quotes and commas to the values
s/([^ ]*) ([^ ]*)/\1"\2",/
Matches "4-QUEIJOS": Mucarela
pattern, initially
"4-QUEIJOS":
Mucarela
Removing the spaces seems to be a clever "trick" so the next iteration can set capture group 2 to the next word and so on, until all the values are properly formatted... then the space is added back in later.
s # sed substitute command, (i.e.: s/old/new/)
/ # start search pattern from here
( # start a capture group (group 1)
[^ # begin negated set (inversely matches set)
# space character
] # end (negated) set (or, set of non-space chars)
* # select as many of these sets as found
) # end capture group (group 1)
# space between first and second capture group
( # begin capture group (group 2)
[^ # begin negated set (inversely matches set)
# space character
] # end (negated) set (or, set of non-space chars)
* # select as many of these sets as found
) # end capture group (group 2)
/ # replace above found items with items below
\1 # represents string in capture group 1
"\2", # surround group 2 w/ quotes and trailing comma
/ # end the replacement
Branch to label 'a'...
t a # if above match was successful, jump back to
# position label 'a' (start from the beginning)
# replacing the second group pattern with "x",
# until there are no more matches.. then go on
Restore spaces between words and make values lowercase
s/(:)(.*")/\1 [\L\2]/
# matches ':"Mucarela","Provolone","Catupiry","Ricota","Oregano"'
# group 1 = ':'
# group 2 = '"Mucarela","Provolone","Catupiry","Ricota","Oregano"'
s # sed substitute command, (i.e.: s/old/new/)
/ # start search pattern from here
( # start a capture group (group 1)
: # look for the colon char
) # end capture group (group 1)
( # start a capture group (group 2)
. # match any char, including space
*" # match any number of chars up to last quote
) # end capture group (group 2)
/ # replace above found groups with items below
\1 # represents string in capture group 1
# output a space after the first item(s)
[\L\2] # set group2 lowercase + surround w/ brackets
/ # end the replacement
Remove last comma on last line
$ s/,$//
# matches ',' at the end of the last line read from the file and removes it
$ # match on the last line in the input file
s # sed substitute command, (i.e.: s/old/new/)
/ # start search pattern from here
,$ # match comma at end of line
// # replace with nothing (delete)
Replace all commas with a comma and a space
s/,/& /g
# matches ',' and replaces with `, `
s # sed substitute command, (i.e.: s/old/new/)
/,/ # match on a comma
& / # replace comma with itself (comma) and space
g # do this for all commas on the line
And now for the play-by-play (only going to show the first line processed and part of the last line for brevity). For this exercise, I added the original data to a file called "input_file" and ran the sed command on it, just like the above answer provided.
First few lines of the --debug output
This shows the commands as interpreted by sed (described in detail above)
SED PROGRAM:
:a
s/([^ ]*) ([^ ]*)/\1"\2",/
t a
s/(:)(.*")/\1 [\L\2]/
$ s/,$//
s/,/& /g
Read in the first line of data from the input file
INPUT: 'input_file' line 1
The rest of these are somewhat self-explanatory or I added "comments" referenced from the man pages for sed.
# pattern to be operated on from the input file
PATTERN: "4-QUEIJOS": Mucarela Provolone Catupiry Ricota Oregano
# :a is a label for 'b' and 't' commands
# 'b a' means to branch to label 'a', unconditionally.
# if 'a' is omitted, branch to end of script.
# 't a' means to branch to label 'a', conditioned on
# a s/// doing a successful substitution since the last
# input line was read and since the last t or T command,
# if label 'a' is omitted, branch to end of script.
COMMAND: :a
# look for this pattern and do the quotes and comma thing...
COMMAND: s/([^ ]*) ([^ ]*)/\1"\2",/
MATCHED REGEX REGISTERS
regex[0] = 0-21 '"4-QUEIJOS": Mucarela'
regex[1] = 0-12 '"4-QUEIJOS":'
regex[2] = 13-21 'Mucarela'
Next - do it again
# the above produced this as an output
PATTERN: "4-QUEIJOS":"Mucarela", Provolone Catupiry Ricota Oregano
# because s/// did a successful substitution since the last input line
# was read and since the last t or T command, branch to label 'a'
COMMAND: t a
# starting back at label 'a' for another iteration
COMMAND: :a
COMMAND: s/([^ ]*) ([^ ]*)/\1"\2",/
MATCHED REGEX REGISTERS
regex[0] = 0-33 '"4-QUEIJOS":"Mucarela", Provolone'
regex[1] = 0-23 '"4-QUEIJOS":"Mucarela",'
regex[2] = 24-33 'Provolone'
Another iteration... another substitution
# the above produced this as an output
PATTERN: "4-QUEIJOS":"Mucarela","Provolone", Catupiry Ricota Oregano
# branch to label 'a' for another iteration
COMMAND: t a
COMMAND: :a
COMMAND: s/([^ ]*) ([^ ]*)/\1"\2",/
MATCHED REGEX REGISTERS
regex[0] = 0-44 '"4-QUEIJOS":"Mucarela","Provolone", Catupiry'
regex[1] = 0-35 '"4-QUEIJOS":"Mucarela","Provolone",'
regex[2] = 36-44 'Catupiry'
...
# the above produced this as an output
PATTERN: "4-QUEIJOS":"Mucarela","Provolone","Catupiry", Ricota Oregano
# branch to label 'a' for another iteration
COMMAND: t a
COMMAND: :a
COMMAND: s/([^ ]*) ([^ ]*)/\1"\2",/
MATCHED REGEX REGISTERS
regex[0] = 0-53 '"4-QUEIJOS":"Mucarela","Provolone","Catupiry", Ricota'
regex[1] = 0-46 '"4-QUEIJOS":"Mucarela","Provolone","Catupiry",'
regex[2] = 47-53 'Ricota'
Finish up that last word..
#... last word of the line to add quotes and a comma to coming right up
PATTERN: "4-QUEIJOS":"Mucarela","Provolone","Catupiry","Ricota", Oregano
COMMAND: t a
COMMAND: :a
COMMAND: s/([^ ]*) ([^ ]*)/\1"\2",/
MATCHED REGEX REGISTERS
regex[0] = 0-63 '"4-QUEIJOS":"Mucarela","Provolone","Catupiry","Ricota", Oregano'
regex[1] = 0-55 '"4-QUEIJOS":"Mucarela","Provolone","Catupiry","Ricota",'
regex[2] = 56-63 'Oregano'
Substitution was made.. branch again...
PATTERN: "4-QUEIJOS":"Mucarela","Provolone","Catupiry","Ricota","Oregano",
COMMAND: t a
COMMAND: :a
COMMAND: s/([^ ]*) ([^ ]*)/\1"\2",/
No more substitutions since last branch.. next check will move on
PATTERN: "4-QUEIJOS":"Mucarela","Provolone","Catupiry","Ricota","Oregano",
COMMAND: t a
# did not branch back to a... now let's enclose the values in a list/ brackets
COMMAND: s/(:)(.*")/\1 [\L\2]/
MATCHED REGEX REGISTERS
regex[0] = 11-64 ':"Mucarela","Provolone","Catupiry","Ricota","Oregano"'
regex[1] = 11-12 ':'
regex[2] = 12-64 '"Mucarela","Provolone","Catupiry","Ricota","Oregano"'
Moving on
# last command produced this... good job
PATTERN: "4-QUEIJOS": ["mucarela","provolone","catupiry","ricota","oregano"],
# is the last line in the file, remove last comma
COMMAND: $ s/,$//
# no match for this one, must not be the last line from file.. moving on
# look for commas and add a space after them
COMMAND: s/,/& /g
MATCHED REGEX REGISTERS
regex[0] = 24-25 ','
# result
PATTERN: "4-QUEIJOS": ["mucarela", "provolone", "catupiry", "ricota", "oregano"],
Would you look at that.. all done on this line!
END-OF-CYCLE:
"4-QUEIJOS": ["mucarela", "provolone", "catupiry", "ricota", "oregano"],
New line read from the file to be iterated on...
INPUT: 'input_file' line 2
PATTERN: "A-MODA": Mucarela Presunto Calabresa Bacon Tomate Milho Oregano
The cycle repeats until we get to the last part of the last line...
# result from previous operation
PATTERN: "PEITO-DE-PERU-ESPECIAL": ["mucarela","peito-de-peru","catupiry","oregano"],
# are we on the last line in the file? yes? k, remove comma at end of line
COMMAND: $ s/,$//
MATCHED REGEX REGISTERS
regex[0] = 75-76 ','
Nice - last line is missing the end of line comma - just need spaces added
PATTERN: "PEITO-DE-PERU-ESPECIAL": ["mucarela","peito-de-peru","catupiry","oregano"]
# check for commas, and replace ',' with ', '
COMMAND: s/,/& /g
MATCHED REGEX REGISTERS
regex[0] = 37-38 ','
PATTERN: "PEITO-DE-PERU-ESPECIAL": ["mucarela", "peito-de-peru", "catupiry", "oregano"]
And there it is... last line.
END-OF-CYCLE:
"PEITO-DE-PERU-ESPECIAL": ["mucarela", "peito-de-peru", "catupiry", "oregano"]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With