I have a text that i need to parse in python. It is a string where i would like to split it to a list of lines, however, if the newlines (\n) is inside quotes then we should ignore it. for example: <pre class="prettyprint"><code>abcd efgh ijk\n1234 567"qqqq\n---" 890\n </code></pre> should be parsed into a list of the following lines: <pre class="prettyprint"><code>abcd efgh ijk 1234 567"qqqq\n---" 890 </code></pre> I've tried to it with <code>split('\n')</code>, but i don't know how to ignore the quotes. Any idea? Thanks!

Here's a much easier solution. Match groups of <code>(?:"[^"]*"|.)+</code>. Namely, "things in quotes or things that aren't newlines". Example: <pre class="prettyprint"><code>import re re.findall('(?:"[^"]*"|.)+', text) </code></pre> <hr> NOTE: This coalesces several newlines into one, as blank lines are ignored. To avoid that, give a null case as well: <code>(?:"[^"]*"|.)+|(?!\Z)</code>. The <code>(?!\Z)</code> is a confusing way to say "not the end of a string". The <code>(?!</code> <code>)</code> is negative lookahead; the <code>\Z</code> is the "end of a string" part. <hr> Tests: <pre class="prettyprint"><code>import re texts = ( 'text', '"text"', 'text\ntext', '"text\ntext"', 'text"text\ntext"text', 'text"text\n"\ntext"text"', '"\n"\ntext"text"', '"\n"\n"\n"\n\n\n""\n"\n"' ) line_matcher = re.compile('(?:"[^"]*"|.)+') for text in texts: print("{:>27} → {}".format( text.replace("\n", "\\n"), " [LINE] ".join(line_matcher.findall(text)).replace("\n", "\\n") )) #>>> text → text #>>> "text" → "text" #>>> text\ntext → text [LINE] text #>>> "text\ntext" → "text\ntext" #>>> text"text\ntext"text → text"text\ntext"text #>>> text"text\n"\ntext"text" → text"text\n" [LINE] text"text" #>>> "\n"\ntext"text" → "\n" [LINE] text"text" #>>> "\n"\n"\n"\n\n\n""\n"\n" → "\n" [LINE] "\n" [LINE] "" [LINE] "\n" </code></pre>

Ok, this seems to work (assuming quotes are properly balanced): <pre class="prettyprint"><code>rx = r"""(?x) \n (?! [^"]* " (?= [^"]* (?: " [^"]* " [^"]* )* $ ) ) """ </code></pre> Test: <pre class="prettyprint"><code>str = """\ first second "qqq qqq qqq " line "third line" AND "spam ham" AND "more quotes" end \ """ import re for x in re.split(rx, str): print '[%s]' % x </code></pre> Result: <pre class="prettyprint"><code>[first] [second "qqq qqq qqq " line] ["third line" AND "spam ham" AND "more quotes"] [end ] </code></pre> If the above looks too weird for you, you can also do this in two steps: <pre class="prettyprint"><code>str = re.sub(r'"[^"]*"', lambda m: m.group(0).replace('\n', '\x01'), str) lines = [x.replace('\x01', '\n') for x in str.splitlines()] for line in lines: print '[%s]' % line # same result </code></pre>

There are many ways to accomplish that. I came up with a very simple one: <pre class="prettyprint"><code>splitted = [""] for i, x in enumerate(re.split('"', text)): if i % 2 == 0: lines = x.split('\n') splitted[-1] += lines[0] splitted.extend(lines[1:]) else: splitted[-1] += '"{0}"'.format(x) </code></pre>

parsing a string in python: how to split newlines while ignoring newline inside quotes

I have a text that i need to parse in python.

It is a string where i would like to split it to a list of lines, however, if the newlines (\n) is inside quotes then we should ignore it.

for example:

abcd efgh ijk\n1234 567"qqqq\n---" 890\n

should be parsed into a list of the following lines:

abcd efgh ijk
1234 567"qqqq\n---" 890

I've tried to it with split('\n'), but i don't know how to ignore the quotes.

Any idea?

Thanks!

Can you split () by a newline Python?

You can use the Python string split() function to split a string (by a delimiter) into a list of strings. To split a string by newline character in Python, pass the newline character "\n" as a delimiter to the split() function.

How do you split a string by space and newline in Python?

Use split() method to split by delimiter. If the argument is omitted, it will be split by whitespace, such as spaces, newlines \n , and tabs \t . Consecutive whitespace is processed together. A list of the words is returned.

How do you separate a string from a new line?

To split a string by newline, call the split() method passing it the following regular expression as parameter - /\r?\ n/ . The split method will split the string on each occurrence of a newline character and return an array containing the substrings. Copied!

Here's a much easier solution.

Match groups of (?:"[^"]*"|.)+. Namely, "things in quotes or things that aren't newlines".

Example:

import re
re.findall('(?:"[^"]*"|.)+', text)

NOTE: This coalesces several newlines into one, as blank lines are ignored. To avoid that, give a null case as well: (?:"[^"]*"|.)+|(?!\Z).

The (?!\Z) is a confusing way to say "not the end of a string". The (?! ) is negative lookahead; the \Z is the "end of a string" part.

Tests:

import re

texts = (
    'text',
    '"text"',
    'text\ntext',
    '"text\ntext"',
    'text"text\ntext"text',
    'text"text\n"\ntext"text"',
    '"\n"\ntext"text"',
    '"\n"\n"\n"\n\n\n""\n"\n"'
)

line_matcher = re.compile('(?:"[^"]*"|.)+')

for text in texts:
    print("{:>27} → {}".format(
        text.replace("\n", "\\n"),
        " [LINE] ".join(line_matcher.findall(text)).replace("\n", "\\n")
    ))

#>>>                        text → text
#>>>                      "text" → "text"
#>>>                  text\ntext → text [LINE] text
#>>>                "text\ntext" → "text\ntext"
#>>>        text"text\ntext"text → text"text\ntext"text
#>>>    text"text\n"\ntext"text" → text"text\n" [LINE] text"text"
#>>>            "\n"\ntext"text" → "\n" [LINE] text"text"
#>>>    "\n"\n"\n"\n\n\n""\n"\n" → "\n" [LINE] "\n" [LINE] "" [LINE] "\n"

You can split it, then reduce it to put together the elements that have an odd number of " :

txt = 'abcd efgh ijk\n1234 567"qqqq\n---" 890\n'
s = txt.split('\n')
reduce(lambda x, y: x[:-1] + [x[-1] + '\n' + y] if x[-1].count('"') % 2 == 1 else x + [y], s[1:], [s[0]])
# ['abcd efgh ijk', '1234 567"qqqq\n---" 890', '']

Explication:

if x[-1].count('"') % 2 == 1
# If there is an odd number of quotes to the last handled element
x[:-1] + [x[-1] + y]
# Append y to this element
else x + [y]
# Else append the element to the handled list

Can also be written like so:

def splitWithQuotes(txt):
    s = txt.split('\n')
    res = []
    for item in s:
        if res and res[-1].count('"') % 2 == 1:
            res[-1] = res[-1] + '\n' + item
        else:
            res.append(item)
    return res
splitWithQuotes(txt)
# ['abcd efgh ijk', '1234 567"qqqq\n---" 890', '']

As pointed out by @Veedrac, this is O(n^2), but this can be prevented by keeping track of the count of ":

def splitWithQuotes(txt):
    s = txt.split('\n')
    res = []
    cnt = 0
    for item in s:
        if res and cnt % 2 == 1:
            res[-1] = res[-1] + '\n' + item
        else:
            res.append(item)
            cnt = 0
        cnt += item.count('"')
    return res
splitWithQuotes(txt)
# ['abcd efgh ijk', '1234 567"qqqq\n---" 890', '']

(The last empty string is because of the last \n at the end of the input string.)

Ok, this seems to work (assuming quotes are properly balanced):

rx = r"""(?x)
    \n
    (?!
        [^"]*
        "
        (?=
            [^"]*
            (?:
                " [^"]* "
                [^"]*
            )*
            $
        )
    )
"""

Test:

str = """\
first
second "qqq
     qqq
     qqq
     " line
"third
    line" AND "spam
        ham" AND "more
            quotes"
end \
"""

import re


for x in re.split(rx, str):
    print '[%s]' % x

Result:

[first]
[second "qqq
     qqq
     qqq
     " line]
["third
    line" AND "spam
        ham" AND "more
            quotes"]
[end ]

If the above looks too weird for you, you can also do this in two steps:

str = re.sub(r'"[^"]*"', lambda m: m.group(0).replace('\n', '\x01'), str)
lines = [x.replace('\x01', '\n') for x in str.splitlines()]

for line in lines:
    print '[%s]' % line  # same result

There are many ways to accomplish that. I came up with a very simple one:

splitted = [""]
for i, x in enumerate(re.split('"', text)):
    if i % 2 == 0:
        lines = x.split('\n')
        splitted[-1] += lines[0]
        splitted.extend(lines[1:])
    else:
        splitted[-1] += '"{0}"'.format(x)

parsing a string in python: how to split newlines while ignoring newline inside quotes

Tags:

python

regex

parsing

Yuval Atzmon

People also ask

Video Answer

4 Answers

Veedrac

njzk2

georg

igortg

Recent Activity

Donate For Us

parsing a string in python: how to split newlines while ignoring newline inside quotes

Tags:

python

regex

parsing

Yuval Atzmon

People also ask

Video Answer

4 Answers

Veedrac

njzk2

georg

igortg

Related questions

Recent Activity

Donate For Us