Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Set data type for specific column when using read_csv from pandas

Tags:

python

pandas

I have a large csv file (~10GB), with around 4000 columns. I know that most of data i will expect is int8, so i set:

pandas.read_csv('file.dat', sep=',', engine='c', header=None, 
                na_filter=False, dtype=np.int8, low_memory=False)

Thing is, the final column (4000th position) is int32, is there away can i tell read_csv that use int8 by default, and at column 4000th, use int 32?

Thank you

like image 295
Xitrum Avatar asked Jun 01 '18 11:06

Xitrum


People also ask

How do I change the datatype of a specific column in pandas?

You can change the column type in pandas dataframe using the df. astype() method. Once you create a dataframe, you may need to change the column type of a dataframe for reasons like converting a column to a number format which can be easily used for modeling and classification.

How do I read a specific column in a CSV file in pandas?

Use pandas. read_csv() to read a specific column from a CSV file. To read a CSV file, call pd. read_csv(file_name, usecols=cols_list) with file_name as the name of the CSV file, delimiter as the delimiter, and cols_list as the list of specific columns to read from the CSV file.

Can pandas column have different data types?

Pandas uses other names for data types than Python, for example: object for textual data. A column in a DataFrame can only have one data type. The data type in a DataFrame's single column can be checked using dtype .


1 Answers

If you are certain of the number you could recreate the dictionary like this:

dtype = dict(zip(range(4000),['int8' for _ in range(3999)] + ['int32']))

Considering that this works:

import pandas as pd
import numpy as np
​
data = '''\
1,2,3
4,5,6'''
​
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, dtype={0:'int8',1:'int8',2:'int32'}, header=None)
​
print(df.dtypes)

Returns:

0     int8
1     int8
2    int32
dtype: object

From the docs:

dtype : Type name or dict of column -> type, default None

Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use str or object to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

like image 87
Anton vBR Avatar answered Nov 08 '22 14:11

Anton vBR