Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

newlines in double quotes in reading CSV in python

Tags:

python

pandas

csv

I have a csv file in the following format:

"4931286","Lotion","New York","Bright color, yellow with 5" long
20% nylon"
"931286","Shampoo","New York","Dark, yellow with 10" long
20% nylon"
"3931286","Conditioner","LA","Bright color, yellow with 5" long
50% nylon"

The above data should be read as 3 rows with 4 columns: ID, product name, location, and description. As can be seen, there are newlines within descriptions for each row.

I've been searching other related stackoverflow questions but none of the solutions seem to solve this issue.

Here is my attempt:

from StringIO import StringIO
file = StringIO("""4931286","Lotion","New York","Bright color, yellow\n   with 5" long 20% nylon""")

for row in csv.reader(file,quotechar='"', delimiter=',',quoting=csv.QUOTE_ALL, skipinitialspace=True):
 print row

And the results look the following:

['4931286"', 'Lotion', 'New York', 'Bright color, yellow with 5 long']
   ['20% nylon']

But, what I want is

['4931286"', 'Lotion', 'New York', 'Bright color, yellow with 5 long 20% nylon']

How could I achieve this? There should be a way in python?

like image 851
user4279562 Avatar asked Apr 24 '26 19:04

user4279562


2 Answers

The data is not in CSV format.

" in CSV must be escaped with \ like "Bright color, yellow\n with 5\" long 20% nylon".

If " is only used for inches (prefixed with number) try this:

import re
data = re.sub(r'([0-9])"(?![,\n])', r'\1\\"', data)

This regex will replace all " with \" if it is prefixed by a number

and then parse the data with csv.reader

Edit: Changed regex because of MaxU's suggestion.

like image 61
Simon Kirsten Avatar answered Apr 26 '26 13:04

Simon Kirsten


How about iterating over every two lines,

import csv
from StringIO import StringIO
from itertools import izip

def pairwise(iterable):
    "s -> (s0, s1), (s2, s3), (s4, s5), ..."
    a = iter(iterable)
    return izip(a, a)


file = StringIO(""""4931286","Lotion","New York","Bright color, yellow with 5" long
20% nylon"
"931286","Shampoo","New York","Dark, yellow with 10" long
20% nylon"
"3931286","Conditioner","LA","Bright color, yellow with 5" long
50% nylon"
""")

reader = csv.reader(file,quotechar='"', delimiter=',',quoting=csv.QUOTE_ALL, skipinitialspace=True)
for row, row2 in pairwise(reader):
    row[-1] = ' '.join([row[-1], row2[0]])
    print(row)

# Output
['4931286', 'Lotion', 'New York', 'Bright color, yellow with 5 long 20% nylon"']
['931286', 'Shampoo', 'New York', 'Dark, yellow with 10 long 20% nylon"']
['3931286', 'Conditioner', 'LA', 'Bright color, yellow with 5 long 50% nylon"']
like image 36
SparkAndShine Avatar answered Apr 26 '26 11:04

SparkAndShine



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!