I have a large .txt with data in bad formats. I would like to remove some rows and convert rest of data to float numbers. I would like to remove rows with 'X'
or 'XX'
, The rest I should convert to float, number like 4;00.1
should be converted to 4.001
The file looks like this sample:
0,1,10/09/2012,3:01,4;09.1,5,6,7,8,9,10,11
1,-0.581586,11/09/2012,-1:93,0;20.3,739705,,0.892921,5,,6,7
2,XX,10/09/2012,3:04,4;76.0,0.183095,-0.057214,-0.504856,NaN,0.183095,12
3,-0.256051,10/09/2012,9:65,1;54.9,483293,0.504967,0.074442,-1.716287,7,0.504967,0.504967
4,-0.728092,11/09/2012,0:78,1;53.4,232247,4.556,0.328062,1.382914,NaN,4.556,4
5,4,11/09/2012,NaN,NaN,6.0008,NaN,NaN,NaN,6.000800,6.000000,6.000800
6,X,11/09/2012,X,X,5,X,8,2,1,17.000000,33.000000
7,,11/09/2012,,,,,,6.000000,5.000000,2.000000,2.000000
8,4,11/09/2012,7:98,3;04.5,5,6,3,7.000000,3.000000,3.000000,2
9,6,11/09/2012,2:21,4;67.2,5,2,2,7,3,8.000000,4.000000
I read it to DataFrame and choose rows
from pandas import *
from csv import *
fileName = '~/data.txt'
colName = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
df = DataFrame(read_csv(fileName, names=colName))
print df[df['b'].isin(['X','XX',None,'NaN'])].to_string()
An output from last last line gives me only:
>>> print df[df['b'].isin(['X','XX',None,'NaN'])].to_string()
b c d e f g h i j k l
a
2 XX 10/09/2012 3:04 4;76.0 0.183095 -0.057214 -0.504856 NaN 0.183095 12 NaN
6 X 11/09/2012 X X 5.000000 X 8.000000 2 1.000000 17 33
Does not pick up row 7, and I would like to go through all df not only one column (original file is very large).
At the moment for conversion I use as below, but need remove unwanted rows first to apply it to all df.
convert1 = lambda x : x.replace('.', '')
convert2 = lambda x : float(x.replace(';', '.'))
newNumber = convert2(convert1(df['e'][0]))
After choosing rows I would like to remove them from df, I try df.pop()
but it works only for column not for rows. I try to name rows but don't luck. In this particular .txt I should finish with a new df from rows [0,3,8,9] with column 'c' as a date format, 'd' as a time format and the rest as the float. I try to figure it out for quite a while now, but do not know where to move, is it possible in pandas (probably should be) or do I need to change to ndarray
or anything else? Thanks for your advise
To drop a row or column in a dataframe, you need to use the drop() method available in the dataframe. You can read more about the drop() method in the docs here. Rows are labelled using the index number starting with 0, by default. Columns are labelled using names.
To delete a row from a DataFrame, use the drop() method and set the index label as the parameter.
To delete rows and columns from DataFrames, Pandas uses the “drop” function. To delete a column, or multiple columns, use the name of the column(s), and specify the “axis” as 1. Alternatively, as in the example below, the 'columns' parameter has been added in Pandas which cuts out the need for 'axis'.
Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.
The problem with your original filter is it checks for 'NaN' rather than numpy.nan
, which is what empty strings are parsed as by default.
If you want to filter all the columns so you only get rows where no element is 'X' or 'XX', do something like this:
In [45]: names = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
In [46]: df = pd.read_csv(StringIO(data), header=None, names=names)
In [47]: mask = df.applymap(lambda x: x in ['X', 'XX', None, np.nan])
In [48]: df[-mask.any(axis=1)]
Out[48]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 9
Data columns:
a 5 non-null values
b 5 non-null values
c 5 non-null values
d 5 non-null values
e 5 non-null values
f 5 non-null values
g 5 non-null values
h 5 non-null values
i 5 non-null values
j 4 non-null values
k 5 non-null values
l 5 non-null values
dtypes: float64(6), int64(1), object(5)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With