How to use square brackets as a quote character in Pandas.read_csv

Tags:

Let's say I have a text file that looks like this:

Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]

What I'd like to be able to do is read that in with pandas.read_csv, but the second row will throw an error. Here is the code I'm currently using:

import pandas as pd
df = pd.read_csv("path/to/file.txt", sep=",", dtype=str)

I've tried to set quotechar to "[", but that obviously just eats up the lines until the next open bracket and adding a closing bracket results in a "string of length 2 found" error. Any insight would be greatly appreciated. Thanks!

Update

There were three primary solutions that were offered: 1) Give a long range of names to the data frame to allow all data to be read in and then post-process the data, 2) Find values in square brackets and put quotes around it, or 3) replace the first n number of commas with semicolons.

Overall, I don't think option 3 is a viable solution in general (albeit just fine for my data) because a) what if I have quoted values in one column that contain commas, and b) what if my column with square brackets is not the last column? That leaves solutions 1 and 2. I think solution 2 is more readable, but solution 1 was more efficient, running in just 1.38 seconds, compared to solution 2, which ran in 3.02 seconds. The tests were run on a text file containing 18 columns and more than 208,000 rows.

442

asked Mar 11 '16 19:03

tblznbits

Video Answer

2 Answers

We can use simple trick - quote balanced square brackets with double quotes:

import re
import six
import pandas as pd


data = """\
Item,Date,Time,Location,junk
1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3]
3,01/10/2016,01:27,[51.2344:-86.24432],[12,13]
4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]"""

print('{0:-^70}'.format('original data'))
print(data)
data = re.sub(r'(\[[^\]]*\])', r'"\1"', data, flags=re.M)
print('{0:-^70}'.format('quoted data'))
print(data)
df = pd.read_csv(six.StringIO(data))
print('{0:-^70}'.format('data frame'))

pd.set_option('display.expand_frame_repr', False)
print(df)

Output:

----------------------------original data-----------------------------
Item,Date,Time,Location,junk
1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3]
3,01/10/2016,01:27,[51.2344:-86.24432],[12,13]
4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]
-----------------------------quoted data------------------------------
Item,Date,Time,Location,junk
1,01/01/2016,13:41,"[45.2344:-78.25453]","[aaaa,bbb]"
2,01/03/2016,19:11,"[43.3423:-79.23423,41.2342:-81242]","[0,1,2,3]"
3,01/10/2016,01:27,"[51.2344:-86.24432]","[12,13]"
4,01/30/2016,05:55,"[51.2344:-86.24432,41.2342:-81242,55.5555:-81242]","[45,55,65]"
------------------------------data frame------------------------------
   Item        Date   Time                                           Location        junk
0     1  01/01/2016  13:41                                [45.2344:-78.25453]  [aaaa,bbb]
1     2  01/03/2016  19:11                 [43.3423:-79.23423,41.2342:-81242]   [0,1,2,3]
2     3  01/10/2016  01:27                                [51.2344:-86.24432]     [12,13]
3     4  01/30/2016  05:55  [51.2344:-86.24432,41.2342:-81242,55.5555:-81242]  [45,55,65]

UPDATE: if you are sure that all square brackets are balances, we don't have to use RegEx's:

import io
import pandas as pd

with open('35948417.csv', 'r') as f:
    fo = io.StringIO()
    data = f.readlines()
    fo.writelines(line.replace('[', '"[').replace(']', ']"') for line in data)
    fo.seek(0)

df = pd.read_csv(fo)
print(df)

answered Sep 29 '22 11:09

MaxU - stop WAR against UA

I think you can replace first 3 occurence of , in each line of file to ; and then use parameter sep=";" in read_csv:

import pandas as pd
import io

with open('file2.csv', 'r') as f:
    lines = f.readlines()
    fo = io.StringIO()
    fo.writelines(u"" + line.replace(',',';', 3) for line in lines)
    fo.seek(0)    

df = pd.read_csv(fo, sep=';')
print df
   Item        Date   Time                            Location
0     1  01/01/2016  13:41                 [45.2344:-78.25453]
1     2  01/03/2016  19:11  [43.3423:-79.23423,41.2342:-81242]
2     3  01/10/2016  01:27                 [51.2344:-86.24432]

Or can try this complicated approach, because main problem is, separator , between values in lists is same as separator of other column values.

So you need post - processing:

import pandas as pd
import io

temp=u"""Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]"""
#after testing replace io.StringIO(temp) to filename
#estimated max number of columns
df = pd.read_csv(io.StringIO(temp), names=range(10))
print df
      0           1      2                    3               4  \
0  Item        Date   Time             Location             NaN   
1     1  01/01/2016  13:41  [45.2344:-78.25453]             NaN   
2     2  01/03/2016  19:11   [43.3423:-79.23423  41.2342:-81242   
3     3  01/10/2016  01:27  [51.2344:-86.24432]             NaN   

                 5   6   7   8   9  
0              NaN NaN NaN NaN NaN  
1              NaN NaN NaN NaN NaN  
2  41.2342:-81242] NaN NaN NaN NaN  
3              NaN NaN NaN NaN NaN

#remove column with all NaN
df = df.dropna(how='all', axis=1)
#first row get as columns names
df.columns = df.iloc[0,:]
#remove first row
df = df[1:]
#remove columns name
df.columns.name = None

#get position of column Location
print df.columns.get_loc('Location')
3
#df1 with Location values
df1 = df.iloc[:, df.columns.get_loc('Location'): ]
print df1
              Location             NaN              NaN
1  [45.2344:-78.25453]             NaN              NaN
2   [43.3423:-79.23423  41.2342:-81242  41.2342:-81242]
3  [51.2344:-86.24432]             NaN              NaN

#combine values to one column
df['Location'] = df1.apply( lambda x : ', '.join([e for e in x if isinstance(e, basestring)]), axis=1)

#subset of desired columns
print df[['Item','Date','Time','Location']]
  Item        Date   Time                                           Location
1    1  01/01/2016  13:41                                [45.2344:-78.25453]
2    2  01/03/2016  19:11  [43.3423:-79.23423, 41.2342:-81242, 41.2342:-8...
3    3  01/10/2016  01:27                                [51.2344:-86.24432]

answered Sep 29 '22 11:09

jezrael

Related questions
                            
                                Transform and remap an equirectangular image with a 90° roll
                            
                                comparing a string 1-d numpy array elementwise
                            
                                matplotlib - pandas - No xlabel and xticks for twinx axes in subploted figures
                            
                                a bytes-like object is required, not 'str' - error
                            
                                Get location specific tweets from Twitter using Tweepy
                            
                                Python unittest branch coverage seems to miss executed generator in zip
                            
                                SQLAlachmey: ORM filter to match all items in a list, not any
                            
                                How to dereference HDF5 references in Python?
                            
                                Display smart pointers in eclipse cdt using gdb pretty printers
                            
                                How can I instal cx_Oracle package to Anaconda 3 to use with python 3.5
                            
                                Recommended way to represent my data with numpy
                            
                                Getting an empty list as attribute when parsing XML with xml.etree.ElementTree
                            
                                Can someone explain this recursive for me?
                            
                                Python multiprocessing pool never finishes
                            
                                PyQT5 OpenGL 4.1 Core profile - invalid frame buffer operation - Mac OS
                            
                                Does Celery Automatically Monkey Patch when running eventlet pool?
                            
                                Why check if cls is the class in __subclasshook__?
                            
                                Wrapping a java function in pyspark
                            
                                Live Video Encoding and Streaming on a Webpage
                            
                                Dynamic settings.py

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use square brackets as a quote character in Pandas.read_csv

Tags:

python

pandas

csv