I have a text that i need to parse in python.
It is a string where i would like to split it to a list of lines, however, if the newlines (\n) is inside quotes then we should ignore it.
for example:
abcd efgh ijk\n1234 567"qqqq\n---" 890\n
should be parsed into a list of the following lines:
abcd efgh ijk
1234 567"qqqq\n---" 890
I've tried to it with split('\n')
, but i don't know how to ignore the quotes.
Any idea?
Thanks!
You can use the Python string split() function to split a string (by a delimiter) into a list of strings. To split a string by newline character in Python, pass the newline character "\n" as a delimiter to the split() function.
Use split() method to split by delimiter. If the argument is omitted, it will be split by whitespace, such as spaces, newlines \n , and tabs \t . Consecutive whitespace is processed together. A list of the words is returned.
To split a string by newline, call the split() method passing it the following regular expression as parameter - /\r?\ n/ . The split method will split the string on each occurrence of a newline character and return an array containing the substrings. Copied!
Here's a much easier solution.
Match groups of (?:"[^"]*"|.)+
. Namely, "things in quotes or things that aren't newlines".
Example:
import re
re.findall('(?:"[^"]*"|.)+', text)
NOTE: This coalesces several newlines into one, as blank lines are ignored. To avoid that, give a null case as well: (?:"[^"]*"|.)+|(?!\Z)
.
The (?!\Z)
is a confusing way to say "not the end of a string". The (?!
)
is negative lookahead; the \Z
is the "end of a string" part.
Tests:
import re
texts = (
'text',
'"text"',
'text\ntext',
'"text\ntext"',
'text"text\ntext"text',
'text"text\n"\ntext"text"',
'"\n"\ntext"text"',
'"\n"\n"\n"\n\n\n""\n"\n"'
)
line_matcher = re.compile('(?:"[^"]*"|.)+')
for text in texts:
print("{:>27} → {}".format(
text.replace("\n", "\\n"),
" [LINE] ".join(line_matcher.findall(text)).replace("\n", "\\n")
))
#>>> text → text
#>>> "text" → "text"
#>>> text\ntext → text [LINE] text
#>>> "text\ntext" → "text\ntext"
#>>> text"text\ntext"text → text"text\ntext"text
#>>> text"text\n"\ntext"text" → text"text\n" [LINE] text"text"
#>>> "\n"\ntext"text" → "\n" [LINE] text"text"
#>>> "\n"\n"\n"\n\n\n""\n"\n" → "\n" [LINE] "\n" [LINE] "" [LINE] "\n"
You can split it, then reduce it to put together the elements that have an odd number of "
:
txt = 'abcd efgh ijk\n1234 567"qqqq\n---" 890\n'
s = txt.split('\n')
reduce(lambda x, y: x[:-1] + [x[-1] + '\n' + y] if x[-1].count('"') % 2 == 1 else x + [y], s[1:], [s[0]])
# ['abcd efgh ijk', '1234 567"qqqq\n---" 890', '']
Explication:
if x[-1].count('"') % 2 == 1
# If there is an odd number of quotes to the last handled element
x[:-1] + [x[-1] + y]
# Append y to this element
else x + [y]
# Else append the element to the handled list
Can also be written like so:
def splitWithQuotes(txt):
s = txt.split('\n')
res = []
for item in s:
if res and res[-1].count('"') % 2 == 1:
res[-1] = res[-1] + '\n' + item
else:
res.append(item)
return res
splitWithQuotes(txt)
# ['abcd efgh ijk', '1234 567"qqqq\n---" 890', '']
As pointed out by @Veedrac, this is O(n^2)
, but this can be prevented by keeping track of the count of "
:
def splitWithQuotes(txt):
s = txt.split('\n')
res = []
cnt = 0
for item in s:
if res and cnt % 2 == 1:
res[-1] = res[-1] + '\n' + item
else:
res.append(item)
cnt = 0
cnt += item.count('"')
return res
splitWithQuotes(txt)
# ['abcd efgh ijk', '1234 567"qqqq\n---" 890', '']
(The last empty string is because of the last \n at the end of the input string.)
Ok, this seems to work (assuming quotes are properly balanced):
rx = r"""(?x)
\n
(?!
[^"]*
"
(?=
[^"]*
(?:
" [^"]* "
[^"]*
)*
$
)
)
"""
Test:
str = """\
first
second "qqq
qqq
qqq
" line
"third
line" AND "spam
ham" AND "more
quotes"
end \
"""
import re
for x in re.split(rx, str):
print '[%s]' % x
Result:
[first]
[second "qqq
qqq
qqq
" line]
["third
line" AND "spam
ham" AND "more
quotes"]
[end ]
If the above looks too weird for you, you can also do this in two steps:
str = re.sub(r'"[^"]*"', lambda m: m.group(0).replace('\n', '\x01'), str)
lines = [x.replace('\x01', '\n') for x in str.splitlines()]
for line in lines:
print '[%s]' % line # same result
There are many ways to accomplish that. I came up with a very simple one:
splitted = [""]
for i, x in enumerate(re.split('"', text)):
if i % 2 == 0:
lines = x.split('\n')
splitted[-1] += lines[0]
splitted.extend(lines[1:])
else:
splitted[-1] += '"{0}"'.format(x)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With