Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select specific CSV columns (Filtering) - Python/pandas

Tags:

python

pandas

csv

I have a very large CSV File with 100 columns. In order to illustrate my problem I will use a very basic example.

Let's suppose that we have a CSV file.

in  value   d     f
0    975   f01    5
1    976   F      4
2    977   d4     1
3    978   B6     0
4    979   2C     0

I want to select a specific columns.

import pandas
data = pandas.read_csv("ThisFile.csv")

In order to select the first 2 columns I used

data.ix[:,:2]

In order to select different columns like the 2nd and the 4th. What should I do?

There is another way to solve this problem by re-writing the CSV file. But it's huge file; So I am avoiding this way.

like image 653
user3378649 Avatar asked Mar 14 '14 01:03

user3378649


People also ask

How do I filter a column in a CSV file in Python?

read_csv() to filter columns from a CSV file. Call pandas. read_csv(filepath_or_buffer, usecols=headers) with filepath_or_buffer as the name of a CSV file and headers as a list of column headers from the file to create a pandas. DataFrame with only those columns.

How do I select selective columns in pandas?

You can use the filter function of the pandas dataframe to select columns containing a specified string in column names. The parameter like of the . filter function defines this specific string. If a column name contains the string specified, that column will be selected and dataframe will be returned.

How do I see specific columns in pandas?

This is the most basic way to select a single column from a dataframe, just put the string name of the column in brackets. Returns a pandas series. Passing a list in the brackets lets you select multiple columns at the same time.


2 Answers

This selects the second and fourth columns (since Python uses 0-based indexing):

In [272]: df.iloc[:,(1,3)]
Out[272]: 
   value  f
0    975  5
1    976  4
2    977  1
3    978  0
4    979  0

[5 rows x 2 columns]

df.ix can select by location or label. df.iloc always selects by location. When indexing by location use df.iloc to signal your intention more explicitly. It is also a bit faster since Pandas does not have to check if your index is using labels.


Another possibility is to use the usecols parameter:

data = pandas.read_csv("ThisFile.csv", usecols=[1,3])

This will load only the second and fourth columns into the data DataFrame.

like image 152
unutbu Avatar answered Sep 24 '22 05:09

unutbu


If you rather select column by name, you can use

data[['value','f']]

   value  f
0    975  5
1    976  4
2    977  1
3    978  0
4    979  0
like image 38
Wai Yip Tung Avatar answered Sep 22 '22 05:09

Wai Yip Tung