Finding bogus data in a pandas dataframe read with read_fwf()

Tags:

I'm trying to analyse the weather records for New York, using the daily data taken from here: http://cdiac.ornl.gov/epubs/ndp/ushcn/daily_doc.html

I'm loading the data with:

tf = pandas.read_fwf(io.open('state30_NY.txt'), widths=widths, names=names, na_values=['-9999'])

Where:

>>> widths
[6, 4, 2, 4, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1]
>>> names
['COOP', 'YEAR', 'MONTH', 'ELEMENT', 'VALUE1', 'MFLAG1', 'QFLAG1', 'SFLAG1', 'VALUE2', 'MFLAG2', 'QFLAG2', 'SFLAG2', 'VALUE3', 'MFLAG3', 'QFLAG3', 'SFLAG3', 'VALUE4', 'MFLAG4', 'QFLAG4', 'SFLAG4', 'VALUE5', 'MFLAG5', 'QFLAG5', 'SFLAG5', 'VALUE6', 'MFLAG6', 'QFLAG6', 'SFLAG6', 'VALUE7', 'MFLAG7', 'QFLAG7', 'SFLAG7', 'VALUE8', 'MFLAG8', 'QFLAG8', 'SFLAG8', 'VALUE9', 'MFLAG9', 'QFLAG9', 'SFLAG9', 'VALUE10', 'MFLAG10', 'QFLAG10', 'SFLAG10', 'VALUE11', 'MFLAG11', 'QFLAG11', 'SFLAG11', 'VALUE12', 'MFLAG12', 'QFLAG12', 'SFLAG12', 'VALUE13', 'MFLAG13', 'QFLAG13', 'SFLAG13', 'VALUE14', 'MFLAG14', 'QFLAG14', 'SFLAG14', 'VALUE15', 'MFLAG15', 'QFLAG15', 'SFLAG15', 'VALUE16', 'MFLAG16', 'QFLAG16', 'SFLAG16', 'VALUE17', 'MFLAG17', 'QFLAG17', 'SFLAG17', 'VALUE18', 'MFLAG18', 'QFLAG18', 'SFLAG18', 'VALUE19', 'MFLAG19', 'QFLAG19', 'SFLAG19', 'VALUE20', 'MFLAG20', 'QFLAG20', 'SFLAG20', 'VALUE21', 'MFLAG21', 'QFLAG21', 'SFLAG21', 'VALUE22', 'MFLAG22', 'QFLAG22', 'SFLAG22', 'VALUE23', 'MFLAG23', 'QFLAG23', 'SFLAG23', 'VALUE24', 'MFLAG24', 'QFLAG24', 'SFLAG24', 'VALUE25', 'MFLAG25', 'QFLAG25', 'SFLAG25', 'VALUE26', 'MFLAG26', 'QFLAG26', 'SFLAG26', 'VALUE27', 'MFLAG27', 'QFLAG27', 'SFLAG27', 'VALUE28', 'MFLAG28', 'QFLAG28', 'SFLAG28', 'VALUE29', 'MFLAG29', 'QFLAG29', 'SFLAG29', 'VALUE30', 'MFLAG30', 'QFLAG30', 'SFLAG30', 'VALUE31', 'MFLAG31', 'QFLAG31', 'SFLAG31']

Now, the issue I have is that when reading in the data, there seem to be a lot of inf values, and those shouldn't be in the source data (the nearest thing in the data are -9999 values, which represent invalid data).

Normally, if I were using lists or the like, I would print out the whole thing to find alignment errors, and work out which rows are affected, then look at the source file to see what's happening. I'd like to know how to do the equivalent in pandas so I can figure out where these inf values are coming from.

Here's the code which shows me infs:

>>> tf[tf['ELEMENT']=='TMIN'].min()
COOP       300023
YEAR         1876
MONTH           1
ELEMENT      TMIN
VALUE1        -38
MFLAG1        inf
QFLAG1        inf
SFLAG1        inf
VALUE2        -34
MFLAG2        inf
QFLAG2        inf
SFLAG2        inf
VALUE3        -38
MFLAG3        inf
QFLAG3        inf
...
MFLAG28    inf
QFLAG28    inf
SFLAG28    inf
VALUE29    -46
MFLAG29    inf
QFLAG29    inf
SFLAG29    inf
VALUE30    -57
MFLAG30    inf
QFLAG30    inf
SFLAG30    inf
VALUE31    -40
MFLAG31    inf
QFLAG31    inf
SFLAG31    inf
Length: 128, dtype: object

Edit: Corrected column widths. Problem still remains.

518

asked Jan 05 '14 15:01

Marcin

1 Answers

First, let's mock up some data:

import numpy as np
import pandas

df = pandas.DataFrame(
    np.random.normal(size=(5,5)), 
    index='rA,rB,rC,rD,rE'.split(','),
    columns='cA,cB,cC,cD,cE'.split(',')
)
df[df > 1] = np.inf
df

That, for examples, should give something like this:

          cA        cB        cC        cD        cE
rA -1.202383 -0.625521       inf -0.888086 -0.215671
rB  0.537521 -1.149731  0.841687  0.190505       inf
rC -1.447124 -0.607486 -1.268923       inf  0.438190
rD -0.275085  0.793483  0.276376 -0.095727 -0.050957
rE -0.095414  0.048926  0.591899  0.298865 -0.308620

So now I can use fancy indexing to isolate all the infs.

print(df[np.isinf(df)].to_string())

    cA  cB   cC   cD   cE
rA NaN NaN  inf  NaN  NaN
rB NaN NaN  NaN  NaN  inf
rC NaN NaN  NaN  inf  NaN
rD NaN NaN  NaN  NaN  NaN
rE NaN NaN  NaN  NaN  NaN

But that's not really useful. So on top of finding the infs, we should stack the column index into the rows (unpivot, if you will) then drop all the NaN values. This will give us a nice summary of the rows/columns with infs.

df[np.isinf(df)].stack().dropna()

rA  cC    inf
rB  cE    inf
rC  cD    inf
dtype: float64

178

answered Oct 21 '22 11:10

Paul H

Related questions
                            
                                Using scipy's kmeans2 function in python
                            
                                How to to determine the number of ways a number can be broken down into sums of smaller numbers
                            
                                Pytest on Python Tools for visual studio
                            
                                Indexing a nested list in python
                            
                                Python inheritance - how to inherit class function?
                            
                                How to provide periodic table information to Python module
                            
                                Most pythonic way to import all objects in a module as their name in the module
                            
                                Qt horizontalSlider send float values
                            
                                Missing 1 required positional argument - Why?
                            
                                Difference between decode and unicode?
                            
                                SQLAlchemy update PostgreSQL array using merge not work
                            
                                Efficiently partition a string at arbitrary index
                            
                                JSON Schema validating ip-address is not working
                            
                                POST data to Python CGI script via jQuery AJAX
                            
                                How To Add An Image To A Tweet With TwitterAPI?
                            
                                How to convert a CP949 RTF to a UTF-8 encoded RTF?
                            
                                Hourly frequency count with Python
                            
                                Using Scipy.optimize method='SLSQP' returns initial guess
                            
                                Converting string file into json format file
                            
                                How to execute SELECT * LIKE statement with a placeholder in sqlite?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Finding bogus data in a pandas dataframe read with read_fwf()

Tags:

python

file-io

pandas

fixed-width

read.fwf

Marcin

People also ask

1 Answers

Paul H

Recent Activity

Donate For Us