I have a large csv file (~10GB), with around 4000 columns. I know that most of data i will expect is int8, so i set:
pandas.read_csv('file.dat', sep=',', engine='c', header=None,
na_filter=False, dtype=np.int8, low_memory=False)
Thing is, the final column (4000th position) is int32, is there away can i tell read_csv that use int8 by default, and at column 4000th, use int 32?
Thank you
You can change the column type in pandas dataframe using the df. astype() method. Once you create a dataframe, you may need to change the column type of a dataframe for reasons like converting a column to a number format which can be easily used for modeling and classification.
Use pandas. read_csv() to read a specific column from a CSV file. To read a CSV file, call pd. read_csv(file_name, usecols=cols_list) with file_name as the name of the CSV file, delimiter as the delimiter, and cols_list as the list of specific columns to read from the CSV file.
Pandas uses other names for data types than Python, for example: object for textual data. A column in a DataFrame can only have one data type. The data type in a DataFrame's single column can be checked using dtype .
If you are certain of the number you could recreate the dictionary like this:
dtype = dict(zip(range(4000),['int8' for _ in range(3999)] + ['int32']))
Considering that this works:
import pandas as pd
import numpy as np
data = '''\
1,2,3
4,5,6'''
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, dtype={0:'int8',1:'int8',2:'int32'}, header=None)
print(df.dtypes)
Returns:
0 int8
1 int8
2 int32
dtype: object
From the docs:
dtype : Type name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use str or object to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With