I have a csv file in the following format:
"4931286","Lotion","New York","Bright color, yellow with 5" long
20% nylon"
"931286","Shampoo","New York","Dark, yellow with 10" long
20% nylon"
"3931286","Conditioner","LA","Bright color, yellow with 5" long
50% nylon"
The above data should be read as 3 rows with 4 columns: ID, product name, location, and description. As can be seen, there are newlines within descriptions for each row.
I've been searching other related stackoverflow questions but none of the solutions seem to solve this issue.
Here is my attempt:
from StringIO import StringIO
file = StringIO("""4931286","Lotion","New York","Bright color, yellow\n with 5" long 20% nylon""")
for row in csv.reader(file,quotechar='"', delimiter=',',quoting=csv.QUOTE_ALL, skipinitialspace=True):
print row
And the results look the following:
['4931286"', 'Lotion', 'New York', 'Bright color, yellow with 5 long']
['20% nylon']
But, what I want is
['4931286"', 'Lotion', 'New York', 'Bright color, yellow with 5 long 20% nylon']
How could I achieve this? There should be a way in python?
The data is not in CSV format.
" in CSV must be escaped with \ like "Bright color, yellow\n with 5\" long 20% nylon".
If " is only used for inches (prefixed with number) try this:
import re
data = re.sub(r'([0-9])"(?![,\n])', r'\1\\"', data)
This regex will replace all " with \" if it is prefixed by a number
and then parse the data with csv.reader
Edit: Changed regex because of MaxU's suggestion.
How about iterating over every two lines,
import csv
from StringIO import StringIO
from itertools import izip
def pairwise(iterable):
"s -> (s0, s1), (s2, s3), (s4, s5), ..."
a = iter(iterable)
return izip(a, a)
file = StringIO(""""4931286","Lotion","New York","Bright color, yellow with 5" long
20% nylon"
"931286","Shampoo","New York","Dark, yellow with 10" long
20% nylon"
"3931286","Conditioner","LA","Bright color, yellow with 5" long
50% nylon"
""")
reader = csv.reader(file,quotechar='"', delimiter=',',quoting=csv.QUOTE_ALL, skipinitialspace=True)
for row, row2 in pairwise(reader):
row[-1] = ' '.join([row[-1], row2[0]])
print(row)
# Output
['4931286', 'Lotion', 'New York', 'Bright color, yellow with 5 long 20% nylon"']
['931286', 'Shampoo', 'New York', 'Dark, yellow with 10 long 20% nylon"']
['3931286', 'Conditioner', 'LA', 'Bright color, yellow with 5 long 50% nylon"']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With