Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Setting column types while reading csv with pandas

Trying to read csv file into pandas dataframe with the following formatting

dp = pd.read_csv('products.csv', header = 0,  dtype = {'name': str,'review': str,
                                                      'rating': int,'word_count': dict}, engine = 'c')
print dp.shape
for col in dp.columns:
    print 'column', col,':', type(col[0])
print type(dp['rating'][0])
dp.head(3)

This is the output:

(183531, 4)
column name : <type 'str'>
column review : <type 'str'>
column rating : <type 'str'>
column word_count : <type 'str'>
<type 'numpy.int64'>

enter image description here

I can sort of understand that pandas might be finding it difficult to convert a string representation of a dictionary into a dictionary given this and this. But how can the content of the "rating" column be both str and numpy.int64???

By the way, tweaks like not specifying an engine or header do not change anything.

Thanks and regards

like image 549
user2738815 Avatar asked Mar 24 '16 07:03

user2738815


People also ask

How does the read_csv function determine column types when reading in a dataset?

When you run read_csv() it prints out a column specification that gives the name and type of each column. That's an important part of readr, which we'll come back to in parsing a file. In both cases read_csv() uses the first line of the data for the column names, which is a very common convention.


1 Answers

In your loop you are doing:

for col in dp.columns:
    print 'column', col,':', type(col[0])

and you are correctly seeing str as the output everywhere because col[0] is the first letter of the name of the column, which is a string.

For example, if you run this loop:

for col in dp.columns:
    print 'column', col,':', col[0]

you will see the first letter of the string of each column name is printed out - this is what col[0] is.

Your loop only iterates on the column names, not on the series data.

What you really want is to check the type of each column's data (not its header or part of its header) in a loop.

So do this instead to get the types of the column data (non-header data):

for col in dp.columns:
    print 'column', col,':', type(dp[col][0])

This is similar to what you did when printing the type of the rating column separately.

like image 114
Colonel Beauvel Avatar answered Oct 20 '22 12:10

Colonel Beauvel