I am new to pandas and still amazed by what it can do, although sometimes also by how things are done ;-)
I managed to write a little script which will report on the number of missing values encountered in a timeseries, either in each month or in each year of the series. Below is the code which uses some dummy data for demonstration.
If I print the returned result (print cnty
or print cntm
), everything looks fine, except that I would like to format the datetime value of the index according to the resolution of my data, i.e. I would wish to have 2000 1000 10 15
instead of 2000-12-31 1000 10 15
for the annual output and 2000-01 744 10 15
for the monthly output. Is there an easy way to do this in pandas or do I have to go through some loops and convert things into "plain" python before printing it. Note: I do not know in advance how many data columns I have, so anything with fixed format strings per row wouldn't work for me.
import numpy as np
import pandas as pd
import datetime as dt
def make_data():
"""Make up some bogus data where we know the number of missing values"""
time = np.array([dt.datetime(2000,1,1)+dt.timedelta(hours=i)
for i in range(1000)])
wd = np.arange(0.,1000.,1.)
ws = wd*0.2
wd[[2,3,4,8,9,22,25,33,99,324]] = -99.9 # 10 missing values
ws[[2,3,4,10,11,12,565,644,645,646,647,648,666,667,669]] =-99.9 # 15 missing values
data = np.array(zip(time,wd,ws), dtype=[('time', dt.datetime),
('wd', 'f4'), ('ws', 'f4')])
return data
def count_miss(data):
time = data['time']
dff = pd.DataFrame(data, index=time)
# two options for setting missing values:
# 1) replace everything less or equal -99
for c in dff.columns:
ser = pd.Series(dff[c])
ser[ser <= -99.] = np.nan
dff[c] = ser
# 2) alternative: if you know the exact value to be replaced
# you can use the DataFrame replace method:
## dff.replace(-99.9, np.nan, inplace=True)
# add the time variable as data column
dff['time'] = time
# count missing values
# the print expressions will print date labels and the total number of values
# in the time column plus the number of missing values for all other columns
# annually:
cnty = dff.resample('A', how='count', closed='right', label='right')
for c in cnty.columns:
if c != 'time':
cnty[c] = cnty['time']-cnty[c]
# monthly:
cntm = dff.resample('M', how='count', closed='right', label='right')
for c in cntm.columns:
if c != 'time':
cntm[c] = cntm['time']-cntm[c]
return cnty, cntm
if __name__ == "__main__":
data = make_data()
cnty, cntm = count_miss(data)
Final note: is a there is a format method to DatetimeIndex, but unfortunately no explanation on how to use it.
The format
method of DatetimeIndex
performs similarly to the strftime
of a datetime.datetime
object.
What that means is that you can use the format strings found here: http://www.tutorialspoint.com/python/time_strftime.htm
The trick is that you have to pass a function formatter
kwarg of the the format
method. That looks like this (just as an example somewhat unrelated to your code:
import pandas
dt = pandas.DatetimeIndex(periods=10, start='2014-02-01', freq='10T')
dt.format(formatter=lambda x: x.strftime('%Y %m %d %H:%M.%S'))
Output:
['2014 02 01 00:00.00',
'2014 02 01 00:10.00',
'2014 02 01 00:20.00',
'2014 02 01 00:30.00',
'2014 02 01 00:40.00',
'2014 02 01 00:50.00',
'2014 02 01 01:00.00',
'2014 02 01 01:10.00',
'2014 02 01 01:20.00',
'2014 02 01 01:30.00']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With