Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parsing a string in python: how to split newlines while ignoring newline inside quotes

I have a text that i need to parse in python.

It is a string where i would like to split it to a list of lines, however, if the newlines (\n) is inside quotes then we should ignore it.

for example:

abcd efgh ijk\n1234 567"qqqq\n---" 890\n

should be parsed into a list of the following lines:

abcd efgh ijk
1234 567"qqqq\n---" 890

I've tried to it with split('\n'), but i don't know how to ignore the quotes.

Any idea?

Thanks!

like image 860
Yuval Atzmon Avatar asked Jun 03 '14 15:06

Yuval Atzmon


People also ask

Can you split () by a newline Python?

You can use the Python string split() function to split a string (by a delimiter) into a list of strings. To split a string by newline character in Python, pass the newline character "\n" as a delimiter to the split() function.

How do you split a string by space and newline in Python?

Use split() method to split by delimiter. If the argument is omitted, it will be split by whitespace, such as spaces, newlines \n , and tabs \t . Consecutive whitespace is processed together. A list of the words is returned.

How do you separate a string from a new line?

To split a string by newline, call the split() method passing it the following regular expression as parameter - /\r?\ n/ . The split method will split the string on each occurrence of a newline character and return an array containing the substrings. Copied!


Video Answer


4 Answers

Here's a much easier solution.

Match groups of (?:"[^"]*"|.)+. Namely, "things in quotes or things that aren't newlines".

Example:

import re
re.findall('(?:"[^"]*"|.)+', text)

NOTE: This coalesces several newlines into one, as blank lines are ignored. To avoid that, give a null case as well: (?:"[^"]*"|.)+|(?!\Z).

The (?!\Z) is a confusing way to say "not the end of a string". The (?! ) is negative lookahead; the \Z is the "end of a string" part.


Tests:

import re

texts = (
    'text',
    '"text"',
    'text\ntext',
    '"text\ntext"',
    'text"text\ntext"text',
    'text"text\n"\ntext"text"',
    '"\n"\ntext"text"',
    '"\n"\n"\n"\n\n\n""\n"\n"'
)

line_matcher = re.compile('(?:"[^"]*"|.)+')

for text in texts:
    print("{:>27} → {}".format(
        text.replace("\n", "\\n"),
        " [LINE] ".join(line_matcher.findall(text)).replace("\n", "\\n")
    ))

#>>>                        text → text
#>>>                      "text" → "text"
#>>>                  text\ntext → text [LINE] text
#>>>                "text\ntext" → "text\ntext"
#>>>        text"text\ntext"text → text"text\ntext"text
#>>>    text"text\n"\ntext"text" → text"text\n" [LINE] text"text"
#>>>            "\n"\ntext"text" → "\n" [LINE] text"text"
#>>>    "\n"\n"\n"\n\n\n""\n"\n" → "\n" [LINE] "\n" [LINE] "" [LINE] "\n"
like image 136
Veedrac Avatar answered Oct 20 '22 21:10

Veedrac


You can split it, then reduce it to put together the elements that have an odd number of " :

txt = 'abcd efgh ijk\n1234 567"qqqq\n---" 890\n'
s = txt.split('\n')
reduce(lambda x, y: x[:-1] + [x[-1] + '\n' + y] if x[-1].count('"') % 2 == 1 else x + [y], s[1:], [s[0]])
# ['abcd efgh ijk', '1234 567"qqqq\n---" 890', '']

Explication:

if x[-1].count('"') % 2 == 1
# If there is an odd number of quotes to the last handled element
x[:-1] + [x[-1] + y]
# Append y to this element
else x + [y]
# Else append the element to the handled list

Can also be written like so:

def splitWithQuotes(txt):
    s = txt.split('\n')
    res = []
    for item in s:
        if res and res[-1].count('"') % 2 == 1:
            res[-1] = res[-1] + '\n' + item
        else:
            res.append(item)
    return res
splitWithQuotes(txt)
# ['abcd efgh ijk', '1234 567"qqqq\n---" 890', '']

As pointed out by @Veedrac, this is O(n^2), but this can be prevented by keeping track of the count of ":

def splitWithQuotes(txt):
    s = txt.split('\n')
    res = []
    cnt = 0
    for item in s:
        if res and cnt % 2 == 1:
            res[-1] = res[-1] + '\n' + item
        else:
            res.append(item)
            cnt = 0
        cnt += item.count('"')
    return res
splitWithQuotes(txt)
# ['abcd efgh ijk', '1234 567"qqqq\n---" 890', '']

(The last empty string is because of the last \n at the end of the input string.)

like image 4
njzk2 Avatar answered Oct 20 '22 19:10

njzk2


Ok, this seems to work (assuming quotes are properly balanced):

rx = r"""(?x)
    \n
    (?!
        [^"]*
        "
        (?=
            [^"]*
            (?:
                " [^"]* "
                [^"]*
            )*
            $
        )
    )
"""

Test:

str = """\
first
second "qqq
     qqq
     qqq
     " line
"third
    line" AND "spam
        ham" AND "more
            quotes"
end \
"""

import re


for x in re.split(rx, str):
    print '[%s]' % x

Result:

[first]
[second "qqq
     qqq
     qqq
     " line]
["third
    line" AND "spam
        ham" AND "more
            quotes"]
[end ]

If the above looks too weird for you, you can also do this in two steps:

str = re.sub(r'"[^"]*"', lambda m: m.group(0).replace('\n', '\x01'), str)
lines = [x.replace('\x01', '\n') for x in str.splitlines()]

for line in lines:
    print '[%s]' % line  # same result
like image 1
georg Avatar answered Oct 20 '22 21:10

georg


There are many ways to accomplish that. I came up with a very simple one:

splitted = [""]
for i, x in enumerate(re.split('"', text)):
    if i % 2 == 0:
        lines = x.split('\n')
        splitted[-1] += lines[0]
        splitted.extend(lines[1:])
    else:
        splitted[-1] += '"{0}"'.format(x)
like image 1
igortg Avatar answered Oct 20 '22 20:10

igortg