Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python re.sub newline multiline dotall

I have this CSV with the next lines written on it (please note the newline /n):

"<a>https://google.com</a>",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,Dirección

I am trying to delete all that commas and putting the address one row up. Thus, on Python I am using this:

with open('Reutput.csv') as e, open('Put.csv', 'w') as ee:
    text = e.read()
    text = str(text)
    re.compile('<a/>*D', re.MULTILINE|re.DOTALL)
    replace = re.sub('<a/>*D','<a/>",D',text) #arreglar comas entre campos
    replace = str(replace)
    ee.write(replace)
f.close()

As far as I know, re.multiline and re.dotall are necessary to fulfill /n needs. I am using re.compile because it is the only way I know to add them, but obviously compiling it is not needed here.

How could I finish with this text?

"<a>https://google.com</a>",Dirección
like image 949
Abueesp Avatar asked Aug 14 '15 20:08

Abueesp


2 Answers

You don't need the compile statement at all, because you aren't using it. You can put either the compiled pattern or the raw pattern in the re.sub function. You also don't need the MULTILINE flag, which has to do with the interpretation of the ^ and $ metacharacters, which you don't use.

The heart of the problem is that you are compiling the flag into a regular expression pattern, but since you aren't using the compiled pattern in your substitute command, it isn't getting recognized.

One more thing. re.sub returns a string, so replace = str(replace) is unnecessary.

Here's what worked for me:

import re
with open('Reutput.csv') as e:
    text = e.read()
    text = str(text)
    s = re.compile('</a>".*D',re.DOTALL)
    replace = re.sub(s, '</a>"D',text) #arreglar comas entre campos
    print(replace)

If you just call re.sub without compiling, you need to call it like

re.sub('</a>".*D', '</a>"D', text, flags=re.DOTALL)

I don't know exactly what your application is, of course, but if all you want to do is to delete all the commas and newlines, it might be clearer to write

replace = ''.join((c for c in text if c not in ',\n'))
like image 122
saulspatz Avatar answered Oct 21 '22 20:10

saulspatz


When you use re.compile you need to save the returned Regular Expression object and then call sub on that. You also need to have a .* to match any character instead of matching close html tags. The re.MULTILINE flag is only for the begin and end string symbols (^ and $) so you do not need it in this case.

regex = re.compile('</a>.*D',re.DOTALL)
replace = regex.sub('</a>",D',text)

That should work. You don't need to convert replace to a string since it is already a string.

Alternative you can write a regular expression that doesn't use .

replace = re.sub('"(,|\n)*D','",D',text)
like image 39
fizzyh2o Avatar answered Oct 21 '22 20:10

fizzyh2o