Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex capture different type of pattern

Tags:

python

regex

I'm trying to capture data from input like:

...
10   79    QUANT. DE ITENS A FORNECER       O    N     9    0   67  75
           E' a quantidade  de  itens  que o fornecedor consegue suprir
           o cliente para uma determinada data. As casa decimais estao 
           definidas no campo 022 (unid. casas decimais).              

11   24    DATA ENTREGA/EMBARQUE DO ITEM    O    N     6    0   76  81
           Data de entrega/embarque do item. Nos casos em que este cam-
           po nao contiver a data, seu conteudo devera ser ajustado en-
           tre as partes. 
...

My goal is to capture: ('10', '79', 'QUANT. DE ITENS A FORNECER', 'O','N', '9', '0', '67', 75') and so on...

My first try was to loop over line and capture as follow:

def parse_line(line):
    pattern = r"\s(\d{1,6}|\w{1})\s" # do not capture the description
    if re.search(pattern, line):
        tab_find = re.findall(pattern, line, re.DOTALL|re.UNICODE)
        if len(tab_find) > 6:
            return tab_find

My Second try was to split the text and append expected result:

def ugly_parsing(line):
    result = [None] * 9 # init list
    tab_r = list(filter(None, re.split(r"\s", line))) # ignore '' 
    keys = [0, 1, -1, -2, -3, -4, -5, -6]
    for i in keys:
        result[i] = tab_r[i]
    result[2] = " ".join(tab_r[2:-6])
    return result

Ignoring the description is OK, but when the description contains a single letter my regex it not working.

like image 331
Ali SAID OMAR Avatar asked Dec 09 '25 09:12

Ali SAID OMAR


1 Answers

Just translate that line into a regex, with all the required numbers and characters, and give whatever remains to the description. You can do this using a non-greedy match: (.+?).

p = re.compile(r"^(\d+)\s+(\d+)\s+(.+?)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)$")
for line in text.splitlines():
    m = p.match(line)
    if m:
        print m.groups()

Output is

('10', '79', 'QUANT. DE ITENS A FORNECER', 'O', 'N', '9', '0', '67', '75')
('11', '24', 'DATA ENTREGA/EMBARQUE DO ITEM', 'O', 'N', '6', '0', '76', '81')

Not sure whether that makes it more readable, but you could also construct that large regex from smaller parts, e.g. "^" + r"(\d+)\s+" * 2 + "(.+?)" + r"\s+(\w+)" * 6 + "$" or "^" + "\s+".join([r"(\d+)"] * 2 + ["(.+?)"] + [r"(\w+)"] * 6) + "$"

Or, depending or your input, you could split by other things than single spaces, such as two-or-more spaces \s{2,} (as suggested in comments) or tabs, but this could yield problems in case the description contains those, too. Using a fixed number of stuff "around" the description might be more reliable.

like image 67
tobias_k Avatar answered Dec 11 '25 23:12

tobias_k



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!