Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: regex to catch data

Tags:

python

regex

I want to ask your help.

I have a large piece of data, which looks like this:

     a
  b : c 901
   d : e sda
 v
     w : x ads
  any
   abc : def 12132
   ghi : jkl dasf
  mno : pqr fas
   stu : vwx utu

Description: file begins with a line containing single word (it can start with whitespace and whitespaces can be also after the word), then follows line of attributes separated by colon (also can have whitespaces), then again line of attributes or line with a single word. I can't create the right regex to catch it in such form:

{
  "a": [["b": "c 901"], ["d", "e sda"]],
  "v": [["w", "x ads"]],
  "any": ["abc", "def 12132"], ["ghi", "jkl dasf"],
  # etc.
}

Here is what I've tried:

regex = str()
regex += "^(?:(?:\\s*)(.*?)(?:\\s*))$",
regex += "(?:(?:^(?:\\s*)(.*?)(?:\\s*):(?:\\s*)(.*?)(?:\\s*))$)*$"
pattern = re.compile(regex, re.S | re.M)

However, it doesn't find what I need. Could you help me? I know I could process file without regex, using line-by-line iterator and checking for ":" symbol, but file is too big to process it this way (if you know how to process it fast without regex, this also will be right answer, but first which comes in mind is too slow).

Thanks in advance!

P.S. In the canonical form of file looks like this:

a
  b : c 901
  d : e sda

Every section begins with a single word, then follow attributes line (after two whitespaces), there attributes are separated with (" : "), then agane attributes line or line with a single word. Other whitespaces are prohibited. Probably it will be easier.

like image 697
ghostmansd Avatar asked Oct 21 '22 17:10

ghostmansd


1 Answers

Are regular expressions really necessary here? Try this pseudocode:

result = {}

last = None
for _line in data:
    line = _line.strip( ).split( ":" )
    if len( line ) == 1:
        last = line[ 0 ]
        if last not in result:
            result[ last ] = []
    elif len( line ) == 2:
        obj = [ line[ 0 ].strip( ), line[ 1 ].strip( ) ]
        result[ last ].append( obj )

I hope I understand correctly your data structure.

like image 178
freakish Avatar answered Nov 03 '22 06:11

freakish