Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use square brackets as a quote character in Pandas.read_csv

Tags:

python

pandas

csv

Let's say I have a text file that looks like this:

Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]

What I'd like to be able to do is read that in with pandas.read_csv, but the second row will throw an error. Here is the code I'm currently using:

import pandas as pd
df = pd.read_csv("path/to/file.txt", sep=",", dtype=str)

I've tried to set quotechar to "[", but that obviously just eats up the lines until the next open bracket and adding a closing bracket results in a "string of length 2 found" error. Any insight would be greatly appreciated. Thanks!

Update

There were three primary solutions that were offered: 1) Give a long range of names to the data frame to allow all data to be read in and then post-process the data, 2) Find values in square brackets and put quotes around it, or 3) replace the first n number of commas with semicolons.

Overall, I don't think option 3 is a viable solution in general (albeit just fine for my data) because a) what if I have quoted values in one column that contain commas, and b) what if my column with square brackets is not the last column? That leaves solutions 1 and 2. I think solution 2 is more readable, but solution 1 was more efficient, running in just 1.38 seconds, compared to solution 2, which ran in 3.02 seconds. The tests were run on a text file containing 18 columns and more than 208,000 rows.

like image 442
tblznbits Avatar asked Mar 11 '16 19:03

tblznbits


People also ask

Why does pandas use square brackets?

The inner square brackets define a Python list with column names, whereas the outer brackets are used to select the data from a pandas DataFrame as seen in the previous example.

What is delimiter in pandas read_csv?

read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, ....) It reads the content of a csv file at given path, then loads the content to a Dataframe and returns that. It uses comma (,) as default delimiter or separator while parsing a file.

What are square brackets used for in Python?

The indexing operator (Python uses square brackets to enclose the index) selects a single character from a string. The characters are accessed by their position or index value.

What output type does pandas read_csv () return?

Read a CSV File In this case, the Pandas read_csv() function returns a new DataFrame with the data and labels from the file data. csv , which you specified with the first argument.


Video Answer


2 Answers

We can use simple trick - quote balanced square brackets with double quotes:

import re
import six
import pandas as pd


data = """\
Item,Date,Time,Location,junk
1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3]
3,01/10/2016,01:27,[51.2344:-86.24432],[12,13]
4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]"""

print('{0:-^70}'.format('original data'))
print(data)
data = re.sub(r'(\[[^\]]*\])', r'"\1"', data, flags=re.M)
print('{0:-^70}'.format('quoted data'))
print(data)
df = pd.read_csv(six.StringIO(data))
print('{0:-^70}'.format('data frame'))

pd.set_option('display.expand_frame_repr', False)
print(df)

Output:

----------------------------original data-----------------------------
Item,Date,Time,Location,junk
1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3]
3,01/10/2016,01:27,[51.2344:-86.24432],[12,13]
4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]
-----------------------------quoted data------------------------------
Item,Date,Time,Location,junk
1,01/01/2016,13:41,"[45.2344:-78.25453]","[aaaa,bbb]"
2,01/03/2016,19:11,"[43.3423:-79.23423,41.2342:-81242]","[0,1,2,3]"
3,01/10/2016,01:27,"[51.2344:-86.24432]","[12,13]"
4,01/30/2016,05:55,"[51.2344:-86.24432,41.2342:-81242,55.5555:-81242]","[45,55,65]"
------------------------------data frame------------------------------
   Item        Date   Time                                           Location        junk
0     1  01/01/2016  13:41                                [45.2344:-78.25453]  [aaaa,bbb]
1     2  01/03/2016  19:11                 [43.3423:-79.23423,41.2342:-81242]   [0,1,2,3]
2     3  01/10/2016  01:27                                [51.2344:-86.24432]     [12,13]
3     4  01/30/2016  05:55  [51.2344:-86.24432,41.2342:-81242,55.5555:-81242]  [45,55,65]

UPDATE: if you are sure that all square brackets are balances, we don't have to use RegEx's:

import io
import pandas as pd

with open('35948417.csv', 'r') as f:
    fo = io.StringIO()
    data = f.readlines()
    fo.writelines(line.replace('[', '"[').replace(']', ']"') for line in data)
    fo.seek(0)

df = pd.read_csv(fo)
print(df)
like image 60
MaxU - stop WAR against UA Avatar answered Sep 29 '22 11:09

MaxU - stop WAR against UA


I think you can replace first 3 occurence of , in each line of file to ; and then use parameter sep=";" in read_csv:

import pandas as pd
import io

with open('file2.csv', 'r') as f:
    lines = f.readlines()
    fo = io.StringIO()
    fo.writelines(u"" + line.replace(',',';', 3) for line in lines)
    fo.seek(0)    

df = pd.read_csv(fo, sep=';')
print df
   Item        Date   Time                            Location
0     1  01/01/2016  13:41                 [45.2344:-78.25453]
1     2  01/03/2016  19:11  [43.3423:-79.23423,41.2342:-81242]
2     3  01/10/2016  01:27                 [51.2344:-86.24432]

Or can try this complicated approach, because main problem is, separator , between values in lists is same as separator of other column values.

So you need post - processing:

import pandas as pd
import io

temp=u"""Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]"""
#after testing replace io.StringIO(temp) to filename
#estimated max number of columns
df = pd.read_csv(io.StringIO(temp), names=range(10))
print df
      0           1      2                    3               4  \
0  Item        Date   Time             Location             NaN   
1     1  01/01/2016  13:41  [45.2344:-78.25453]             NaN   
2     2  01/03/2016  19:11   [43.3423:-79.23423  41.2342:-81242   
3     3  01/10/2016  01:27  [51.2344:-86.24432]             NaN   

                 5   6   7   8   9  
0              NaN NaN NaN NaN NaN  
1              NaN NaN NaN NaN NaN  
2  41.2342:-81242] NaN NaN NaN NaN  
3              NaN NaN NaN NaN NaN  
#remove column with all NaN
df = df.dropna(how='all', axis=1)
#first row get as columns names
df.columns = df.iloc[0,:]
#remove first row
df = df[1:]
#remove columns name
df.columns.name = None

#get position of column Location
print df.columns.get_loc('Location')
3
#df1 with Location values
df1 = df.iloc[:, df.columns.get_loc('Location'): ]
print df1
              Location             NaN              NaN
1  [45.2344:-78.25453]             NaN              NaN
2   [43.3423:-79.23423  41.2342:-81242  41.2342:-81242]
3  [51.2344:-86.24432]             NaN              NaN

#combine values to one column
df['Location'] = df1.apply( lambda x : ', '.join([e for e in x if isinstance(e, basestring)]), axis=1)

#subset of desired columns
print df[['Item','Date','Time','Location']]
  Item        Date   Time                                           Location
1    1  01/01/2016  13:41                                [45.2344:-78.25453]
2    2  01/03/2016  19:11  [43.3423:-79.23423, 41.2342:-81242, 41.2342:-8...
3    3  01/10/2016  01:27                                [51.2344:-86.24432]
like image 31
jezrael Avatar answered Sep 29 '22 11:09

jezrael