Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fix a regex that attemps to catch some word and id?

I have a .txt file that looks like the s string. The s string is conformed by word_1 followed by word_2 an id and a number:

word_1 word_2 id number

I would like to create a regex that catch in a list all the ocurrences of the word "nunca" followed by the id VM_ _ _ _. The constrait to extract the "nunca" and VM_ _ _ _ pattern is that the ocurrences must appear one after another, where _ are free characters of the id string e.g. :

nunca nunca RG 0.293030
first_word second_word VM223FDS 0.902333
error errpr RG 0.345355667
nunca nunca RG 0.1489098
ninguna ninguno DI0S3DF 0.345344
third fourth VM34SDF 0.7865489

This is the pattern I would like to extract since they are placed one after another. And this will be the desired output in a list:

[(nunca,RG),(second_word, VM223FDS)]

For example this will be wrong, since they are not one after another:

nunca nunca RG 0.293030
prendas prenda NCFP000 0.95625
success success VM23434SDF 0.902333

So for the s string:

s = '''Vaya ir VMM03S0 0.427083
mañanita mañana RG 0.796611
, , Fc 1
buscando buscar VMG0000 1
una uno DI0FS0 0.951575
lavadora lavadora NCFS000 0.414738
con con SPS00 1
la el DA0FS0 0.972269
que que PR0CN000 0.562517
sorprender sorprender VMN0000 1
a a SPS00 0.996023
una uno DI0FS0 0.951575
persona persona NCFS000 0.98773
muy muy RG 1
especial especial AQ0CS0 1
para para SPS00 0.999103
nosotros nosotros PP1MP000 1
, , Fc 1
y y CC 0.999962
la lo PP3FSA00 0.0277039
encontramos encontrar VMIP1P0 0.65
. . Fp 1

Pero pero CC 0.999764
vamos ir VMIP1P0 0.655914
a a SPS00 0.996023
lo el DA0NS0 0.457533
que que PR0CN000 0.562517
interesa interesar VMIP3S0 0.994868
LO_QUE_INTERESA_La lo_que_interesa_la NP00000 1
lavadora lavador AQ0FS0 0.585262
tiene tener VMIP3S0 1
una uno DI0FS0 0.951575
clasificación clasificación NCFS000 1
A+ a+ NP00000 1
, , Fc 1
de de SPS00 0.999984
las el DA0FP0 0.970954
que que PR0CN000 0.562517
ahorran ahorrar VMIP3P0 1
energía energía NCFS000 1
, , Fc 1
si si CS 0.99954
no no RN 0.998134
me me PP1CS000 0.89124
equivoco equivocar VMIP1S0 1
. . Fp 1

Lava lavar VMIP3S0 0.397388
hasta hasta SPS00 0.957698
7 7 Z 1
kg kilogramo NCMN000 1
, , Fc 1
no no RN 0.998134
está estar VAIP3S0 0.999201
nada nada RG 0.135196
mal mal RG 0.497537
, , Fc 1
se se P00CN000 0.465639
le le PP3CSD00 1
veía ver VMII3S0 0.62272
un uno DI0MS0 0.987295
gran gran AQ0CS0 1
tambor tambor NCMS000 1
( ( Fpa 1
de de SPS00 0.999984
acero acero NCMS000 0.973481
inoxidable inoxidable AQ0CS0 1
) ) Fpt 1
y y CC 0.999962
un uno DI0MS0 0.987295
error error NCFSD23 0.234930
error error VMDFG34 0.98763
consumo consumo NCMS000 0.948927
máximo máximo AQ0MS0 0.986111
de de SPS00 0.999984
49 49 Z 1
litros litro NCMP000 1
error error DI0S3DF 1
Mandos mandos NP00000 1
intuitivos intuitivo AQ0MP0 1
, , Fc 1
todo todo PI0MS000 0.43165
muy muy RG 1
bien bien RG 0.902728
explicado explicar VMP00SM 1
, , Fc 1
jamas jamas RG 0.343443
nada nada PI0CS000 0.850279
que que PR0CN000 0.562517
ver ver VMN0000 0.997382
con con SPS00 1
la el DA0FS0 0.972269
lavadora lavadora NCFS000 0.414738
de de SPS00 0.999984
nunca nunca RG 0.903
casa casa NCFS000 0.979058
de de SPS00 0.999984
mis mi DP1CPS 0.995868
error error VM9032 0.234323
string string VMWEOO 0.03444
padres padre NCMP000 1
Además además NP00000 1
incluye incluir VMIP3S0 0.994868
la el DA0FS0 0.972269
tecnología tecnología NCFS000 1
error errpr RG2303 1
Textileprotec textileprotec NP00000 1
que que PR0CN000 0.562517
protege proteger VMIP3S0 0.994868
nuestras nuestro DP1FPP 0.994186
ninguna ninguno DI0S3DF 0.345344
falla falla NCFSD23 1
prendas prenda NCFP000 0.95625
más más RG 1
preciadas preciar VMP00PF 1
jamas jamas RG2303 1
string string VM9032 0.234323
nunca nunca RG 0.293030
success success VM23SDF 0.902333
. . Fp 1'''

this is what I tried:

import re
pattern__ = re.findall(r'(?m)^.*?\b(nunca)\s+(\S+)\s+[0-9.]+\n.*?\s(\S+)\s+(VM\S+)\s+[0-9.]+$', s)

print pattern__

The problem with this aproach is that it returns a blank list: []. Any idea of how to fix this in order to obtain:

[('nunca','RG'),('success','VM23SDF')]

Thanks in advance guys!

like image 369
john doe Avatar asked Apr 22 '15 22:04

john doe


Video Answer


3 Answers

Making some assumption of uniformity of the format, and if I understand right that it's possible to search only for word_2, then the regex can be very simple:

regex = re.compile("(nunca)\s(\S+)\s\d\S*\n\S+\s(\S+)\s(VM)", re.MULTILINE)
regex.findall(string)

I'm not a Python user, I tested my regex here

UPDATE After the correction John made, the new regex will be:

regex = re.compile("(nunca)\s(\S+)\s\d\S*\n\S+\s(\S+)\s(VM)(\S+)?", re.MULTILINE)
regex.findall(string)

This way you'll be able to catch both VM and the ID separetely. If you want them together simply change to (VM\S+)

like image 122
ColOfAbRiX Avatar answered Nov 15 '22 08:11

ColOfAbRiX


You could parse the file line by line checking a pair of lines each time:

import csv
with open("in.txt") as f:
    reader = csv.reader(f,delimiter=" ")
    prev = next(reader)
    for row in reader:
        if "VM" in row and "nunca" in prev:
            nun, val, = prev[-3:-1]
            wrd, i = row[-3:-1]
            print(nun, val, wrd, i)
        prev = row

('nunca', 'RG', 'success', 'VM')

Almost 20 times faster than using a regex:

In [1]: %%timeit
   ...: with open("test.txt") as f:
   ...:     import re
   ...:     pr= re.findall(ur'.*?\b(nunca)\s+(\S+)\s+[0-9.]+[\r\n]+\S+\s+(\S+)\s+(VM)\s+[0-9.]+',f.read())
   ...: 
1000 loops, best of 3: 936 µs per loop

In [2]: import csv

In [3]: %%timeit
   ...: with open("test.txt") as f:
   ...:     reader = csv.reader(f,delimiter=" ")
   ...:     prev = next(reader)
   ...:     for row in reader:
   ...:         if "VM" in row and "nunca" in prev:
   ...:             nun, val, = prev[-3:-1]
   ...:             wrd, i = row[-3:-1]
   ...: 
10000 loops, best of 3: 59 µs per loop

For your update:

import csv
with open(in.txt") as f:
    reader = csv.reader(f,delimiter=" ")
    prev = next(reader)
    for row in reader:
        if len(row) < 2:
            continue
        if row[-2].startswith("VM") and "nunca" in prev:
            nun, val, = prev[-3:-1]
            wrd, i = row[-3:-1]
            print(nun, val, wrd, i)
        prev = row

('nunca', 'RG', 'success', 'VM23SDF')

Based in your input it seems VM___ is always the second last element, if it can be anywhere in the row use:

`if any(ele.startswith("VM") for ele in row)`
like image 37
Padraic Cunningham Avatar answered Nov 15 '22 10:11

Padraic Cunningham


I guess this regex helps:

ur'.*?\b(nunca)\s+(\S+)\s+[0-9.]+[\r\n]+\S+\s+(\S+)\s+(VM\S+)\s+[0-9.]+'

See demo.

like image 36
Wiktor Stribiżew Avatar answered Nov 15 '22 08:11

Wiktor Stribiżew