I want to ask your help.
I have a large piece of data, which looks like this:
a
b : c 901
d : e sda
v
w : x ads
any
abc : def 12132
ghi : jkl dasf
mno : pqr fas
stu : vwx utu
Description: file begins with a line containing single word (it can start with whitespace and whitespaces can be also after the word), then follows line of attributes separated by colon (also can have whitespaces), then again line of attributes or line with a single word. I can't create the right regex to catch it in such form:
{
"a": [["b": "c 901"], ["d", "e sda"]],
"v": [["w", "x ads"]],
"any": ["abc", "def 12132"], ["ghi", "jkl dasf"],
# etc.
}
Here is what I've tried:
regex = str()
regex += "^(?:(?:\\s*)(.*?)(?:\\s*))$",
regex += "(?:(?:^(?:\\s*)(.*?)(?:\\s*):(?:\\s*)(.*?)(?:\\s*))$)*$"
pattern = re.compile(regex, re.S | re.M)
However, it doesn't find what I need. Could you help me? I know I could process file without regex, using line-by-line iterator and checking for ":" symbol, but file is too big to process it this way (if you know how to process it fast without regex, this also will be right answer, but first which comes in mind is too slow).
Thanks in advance!
P.S. In the canonical form of file looks like this:
a
b : c 901
d : e sda
Every section begins with a single word, then follow attributes line (after two whitespaces), there attributes are separated with (" : "), then agane attributes line or line with a single word. Other whitespaces are prohibited. Probably it will be easier.
Are regular expressions really necessary here? Try this pseudocode:
result = {}
last = None
for _line in data:
line = _line.strip( ).split( ":" )
if len( line ) == 1:
last = line[ 0 ]
if last not in result:
result[ last ] = []
elif len( line ) == 2:
obj = [ line[ 0 ].strip( ), line[ 1 ].strip( ) ]
result[ last ].append( obj )
I hope I understand correctly your data structure.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With