Suppose I have a csv file with 400 columns. I cannot load the entire file into a DataFrame (won't fit in memory). However, I only really want 50 columns, and this will fit in memory. I don't see any built in Pandas way to do this. What do you suggest? I'm open to using the PyTables
interface, or pandas.io.sql
.
The best-case scenario would be a function like: pandas.read_csv(...., columns=['name', 'age',...,'income'])
. I.e. we pass a list of column names (or numbers) that will be loaded.
You can create a new DataFrame of a specific column by using DataFrame. assign() method. The assign() method assign new columns to a DataFrame, returning a new object (a copy) with the new columns added to the original ones.
Use pandas. read_csv() to read a specific column from a CSV file. To read a CSV file, call pd. read_csv(file_name, usecols=cols_list) with file_name as the name of the CSV file, delimiter as the delimiter, and cols_list as the list of specific columns to read from the CSV file.
Step 1) To read data from CSV files, you must use the reader function to generate a reader object. The reader function is developed to take each row of the file and make a list of all columns. Then, you have to choose the column you want the variable data for.
Ian, I implemented a usecols
option which does exactly what you describe. It will be in upcoming pandas 0.10; development version will be available soon.
Since 0.10
, you can use usecols
like
df = pd.read_csv(...., usecols=['name', 'age',..., 'income'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With