Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting string between quotes split across multiple lines in Python

I have a file containing multiple entries. Each entry is of the following form:

"field1","field2","field3","field4","field5"

All of the fields are guaranteed to not contain any quotes, however they can contain ,. The problem is that field4 can be split across multiple lines. So an example file can look like:

"john","male US","done","Some sample text
across multiple lines. There
can be many lines of this","foo bar baz"
"jane","female UK","done","fields can have , in them","abc xyz"

I want to extract the fields using Python. If the field would not have been split across multiple lines this would have been simple: Extract string from between quotations. But I can't seem to find a simple way to do this in presence of multiline fields.

EDIT: There are actually five fields. Sorry about the confusion if any. The question has been edited to reflect this.

like image 493
Subhasis Das Avatar asked Aug 31 '13 22:08

Subhasis Das


People also ask

How do I get the string between two quotes in Python?

Use the re. findall() method to extract strings between quotes, e.g. my_list = re. findall(r'"([^"]*)"', my_str) .

How do you split a quote in Python?

If you want to keep quotes around the quoted tokens, specify shlex. split(line, posix=False) .

How do you get a string between quotes?

We can extract strings in between the quotations using split() method and slicing.

How do you find a quoted string in Python?

index() to find where the quotes("") begin and end? temp. index('"') , or temp. index("\"") .


2 Answers

I think that the csv module can solve this problem. It splits correctly with newlines:

import csv 

f = open('infile', newline='')
reader = csv.reader(f)
for row in reader:
    for field in row:
        print('-- {}'.format(field))

It yields:

-- john
-- male US
-- done
-- Some sample text
across multiple lines. There
can be many lines of this
-- foo bar baz
-- jane
-- female UK
-- done
-- fields can have , in them
-- abc xyz
like image 96
Birei Avatar answered Sep 20 '22 17:09

Birei


The answer from the question you linked worked for me:

import re
f = open("test.txt")
text = f.read()

string_list = re.findall('"([^"]*"', text)

At this point, string_list contains your strings. Now, these strings can have line breaks in them, but you can use

new_string = string_list.replace("\n", " ")

to clean that up.

like image 42
Mark R. Wilkins Avatar answered Sep 17 '22 17:09

Mark R. Wilkins