Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

removing particular rows from DataFrame in python pandas

Tags:

python

pandas

I have a large .txt with data in bad formats. I would like to remove some rows and convert rest of data to float numbers. I would like to remove rows with 'X' or 'XX', The rest I should convert to float, number like 4;00.1 should be converted to 4.001 The file looks like this sample:

0,1,10/09/2012,3:01,4;09.1,5,6,7,8,9,10,11
1,-0.581586,11/09/2012,-1:93,0;20.3,739705,,0.892921,5,,6,7
2,XX,10/09/2012,3:04,4;76.0,0.183095,-0.057214,-0.504856,NaN,0.183095,12
3,-0.256051,10/09/2012,9:65,1;54.9,483293,0.504967,0.074442,-1.716287,7,0.504967,0.504967
4,-0.728092,11/09/2012,0:78,1;53.4,232247,4.556,0.328062,1.382914,NaN,4.556,4
5,4,11/09/2012,NaN,NaN,6.0008,NaN,NaN,NaN,6.000800,6.000000,6.000800
6,X,11/09/2012,X,X,5,X,8,2,1,17.000000,33.000000
7,,11/09/2012,,,,,,6.000000,5.000000,2.000000,2.000000
8,4,11/09/2012,7:98,3;04.5,5,6,3,7.000000,3.000000,3.000000,2
9,6,11/09/2012,2:21,4;67.2,5,2,2,7,3,8.000000,4.000000

I read it to DataFrame and choose rows

from pandas import *
from csv import *
fileName = '~/data.txt'
colName = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
df = DataFrame(read_csv(fileName, names=colName))
print df[df['b'].isin(['X','XX',None,'NaN'])].to_string()

An output from last last line gives me only:

>>> print df[df['b'].isin(['X','XX',None,'NaN'])].to_string()
    b           c     d       e         f          g         h   i         j   k   l
a                                                                                   
2  XX  10/09/2012  3:04  4;76.0  0.183095  -0.057214 -0.504856 NaN  0.183095  12 NaN
6   X  11/09/2012     X       X  5.000000          X  8.000000   2  1.000000  17  33

Does not pick up row 7, and I would like to go through all df not only one column (original file is very large).

At the moment for conversion I use as below, but need remove unwanted rows first to apply it to all df.

convert1 = lambda x : x.replace('.', '')
convert2 = lambda x : float(x.replace(';', '.'))
newNumber = convert2(convert1(df['e'][0])) 

After choosing rows I would like to remove them from df, I try df.pop() but it works only for column not for rows. I try to name rows but don't luck. In this particular .txt I should finish with a new df from rows [0,3,8,9] with column 'c' as a date format, 'd' as a time format and the rest as the float. I try to figure it out for quite a while now, but do not know where to move, is it possible in pandas (probably should be) or do I need to change to ndarray or anything else? Thanks for your advise

like image 878
tomasz74 Avatar asked Sep 22 '12 22:09

tomasz74


People also ask

How do I drop a specific row in a DataFrame?

To drop a row or column in a dataframe, you need to use the drop() method available in the dataframe. You can read more about the drop() method in the docs here. Rows are labelled using the index number starting with 0, by default. Columns are labelled using names.

How do I delete rows from a DataFrame pandas?

To delete a row from a DataFrame, use the drop() method and set the index label as the parameter.

How do I delete multiple rows in a DataFrame in Python?

To delete rows and columns from DataFrames, Pandas uses the “drop” function. To delete a column, or multiple columns, use the name of the column(s), and specify the “axis” as 1. Alternatively, as in the example below, the 'columns' parameter has been added in Pandas which cuts out the need for 'axis'.

How do you remove a specific value from a data frame?

Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.


1 Answers

The problem with your original filter is it checks for 'NaN' rather than numpy.nan, which is what empty strings are parsed as by default. If you want to filter all the columns so you only get rows where no element is 'X' or 'XX', do something like this:

In [45]: names = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']

In [46]: df = pd.read_csv(StringIO(data), header=None, names=names)

In [47]: mask = df.applymap(lambda x: x in ['X', 'XX', None, np.nan])

In [48]: df[-mask.any(axis=1)]
Out[48]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 9
Data columns:
a    5  non-null values
b    5  non-null values
c    5  non-null values
d    5  non-null values
e    5  non-null values
f    5  non-null values
g    5  non-null values
h    5  non-null values
i    5  non-null values
j    4  non-null values
k    5  non-null values
l    5  non-null values
dtypes: float64(6), int64(1), object(5)
like image 166
Chang She Avatar answered Nov 15 '22 08:11

Chang She