Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3 reading CSV file with line breaks in rows

Tags:

python

csv

I have a large CSV file with one column and line breaks in some of its rows. I want to read the content of each cell and write it to a text file but the CSV reader is splitting the cells with line breaks into multiple ones (multiple rows) and writing each one to a separate text file.

Using Python 3.6.2 on a MAC Sierra

Here is an example:

"content of row 1"
"content of row 2 
 continues here"
"content of row 3"

And here is how I am reading it:

with open(csvFileName, 'r') as csvfile:

    lines= csv.reader(csvfile)

    i=0
    for row in lines:
        i+=1
        content= row

        outFile= open("output"+str(i)+".txt", 'w')

        outFile.write(content)

        outFile.close()

This is creating 4 files instead of 3 for each row. Any suggestions on how to ignore the line break in the second row?

like image 689
Labibah Avatar asked Sep 05 '17 18:09

Labibah


2 Answers

You could define a regular expression pattern to help you iterate over the rows.

Read the entire file contents - if possible.

s = '''"content of row 1"
"content of row 2 
 continues here"
"content of row 3"'''

Pattern - double-quote, followed by anything that isn't a double-quote, followed by a double-quote.:

row_pattern = '''"[^"]*"'''
row = re.compile(row_pattern, flags = re.DOTALL | re.MULTILINE)

Iterate the rows:

for r in row.finditer(s):
    print r.group()
    print '******'

>>> 
"content of row 1"
******
"content of row 2 
 continues here"
******
"content of row 3"
******
>>>
like image 151
wwii Avatar answered Nov 03 '22 13:11

wwii


The file you describe is NOT a CSV (comma separated values) file. A CSV file is a list of records one per line where each record is separated from the others by commas. There are various "flavors" of CSV which support various features for quoting fields (in case fields have embedded commas in them, for example).

I think your best bet would be to create an adapter class/instance which would pre-process the raw file, find and merge the continuation lines into records and them pass those to your instance of csv.reader. You could model your class after StringIO from the Python standard libraries.

The point is that you create something which processes data but behaves enough like a file object that it can be used, transparently, as the input source for something like csv.reader().

(Done properly you can even implement the Python context manager protocol. io.StringIO does support this protocol and could be used as a reference. This would allow you to use instances of this hypothetical "line merging" adapter class in a Python with statement just as you're doing with your open file() object in your example code).

from io import StringIO
import csv
data = u'1,"a,b",2\n2,ab,2.1\n'
with StringIO(data) as infile:
    reader = csv.reader(infile, quotechar='"')
    for rec in reader:
        print(rec[0], rec[2], rec[1])

That's just a simple example of using the io.StringIO in a with statement Note that io.StringIO requires Unicode data, io.BytesIO requires "bytes" or string data (at least in 2.7.x). Your adapter class can do whatever you like.

like image 21
Jim Dennis Avatar answered Nov 03 '22 15:11

Jim Dennis