I'm new to pandas and I'm trying to read a strange formated file into a DataFrame. The original file looks like this:
; No Time Date MoistAve MatTemp TDRConduct TDRAve DeltaCount tpAve Moist1 Moist2 Moist3 Moist4 TDR1 TDR2 TDR3 TDR4
1 11:38:17 11.07.2012 11.37 48.20 5.15 88.87 15 344.50 11.84 11.35 11.59 15.25 89.0 89.0 89.0 88.0
2 11:38:18 11.07.2012 11.44 48.20 5.13 88.88 2 346.22 12.08 11.83 -1.00 -1.00 89.0 89.0 -1.0 -1.0
3 11:38:19 11.07.2012 11.10 48.20 4.96 89.00 3 337.84 11.83 11.59 10.62 -1.00 89.0 89.0 89.0 -1.0
4 11:38:19 11.07.2012 11.82 48.20 5.54 88.60 3 355.92 11.10 13.54 12.32 -1.00 89.0 88.0 88.0 -1.0
I managed to get an equally structured DataFrame with:
In [42]: date_spec = {'FetchTime': [1, 2]}
In [43]: df = pd.read_csv('MeasureCK32450-20120711114050.mck', header=7, sep='\s\s+',
parse_dates=date_spec, na_values=['-1.0', '-1.00'])
In [44]: df
Out[52]:
FetchTime ; No MoistAve MatTemp TDRConduct TDRAve DeltaCount tpAve Moist1 Moist2 Moist3 Moist4 TDR1 TDR2 TDR3 TDR4
0 2012-11-07 11:38:17 1 11.37 48.2 5.15 88.87 15 344.50 11.84 11.35 11.59 15.25 89 89 89 88
1 2012-11-07 11:38:18 2 11.44 48.2 5.13 88.88 2 346.22 12.08 11.83 NaN NaN 89 89 NaN NaN
2 2012-11-07 11:38:19 3 11.10 48.2 4.96 89.00 3 337.84 11.83 11.59 10.62 NaN 89 89 89 NaN
3 2012-11-07 11:38:19 4 11.82 48.2 5.54 88.60 3 355.92 11.10 13.54 12.32 NaN 89 88 88 NaN
But now I have to expand each line of this DataFrame
.... Moist1 Moist2 Moist3 Moist4 TDR1 TDR2 TDR3 TDR4
1 .... 11.84 11.35 11.59 15.25 89 89 89 88
2 .... 12.08 11.83 NaN NaN 89 89 NaN NaN
into four lines (with three indexes No, FetchTime, and MeasureNo):
.... Moist TDR
No FetchTime MeasureNo
0 2012-11-07 11:38:17 1 .... 11.84 89 # from line 1, Moist1 and TDR1
1 2 .... 11.35 89 # from line 1, Moist2 and TDR2
2 3 .... 11.59 89 # from line 1, Moist3 and TDR3
3 4 .... 15.25 88 # from line 1, Moist4 and TDR4
4 2012-11-07 11:38:18 1 .... 12.08 89 # from line 2, Moist1 and TDR1
5 2 .... 11.83 89 # from line 2, Moist2 and TDR2
6 3 .... NaN NaN # from line 2, Moist3 and TDR3
7 4 .... NaN NaN # from line 2, Moist4 and TDR4
by preserving the other columns and MOST important, preserving the order of the entries. I
know I can iterate through each line with for row in df.iterrows(): ...
but I read this is
not very fast. My first approach was this:
In [54]: data = []
In [55]: for d in range(1,5):
....: temp = df.ix[:, ['FetchTime', 'MoistAve', 'MatTemp', 'TDRConduct', 'TDRAve', 'DeltaCount', 'tpAve', 'Moist%d' % d, 'TDR%d' % d]]
....: temp.columns = ['FetchTime', 'MoistAve', 'MatTemp', 'TDRConduct', 'TDRAve', 'DeltaCount', 'tpAve', 'RawMoist', 'RawTDR']
....: temp['MeasureNo'] = d
....: data.append(temp)
....:
In [56]: test = pd.concat(data, ignore_index=True)
In [62]: test.head()
Out[62]:
FetchTime MoistAve MatTemp TDRConduct TDRAve DeltaCount tpAve RawMoist RawTDR MeasureNo
0 2012-11-07 11:38:17 11.37 48.2 5.15 88.87 15 344.50 11.84 89 1
1 2012-11-07 11:38:18 11.44 48.2 5.13 88.88 2 346.22 12.08 89 1
2 2012-11-07 11:38:19 11.10 48.2 4.96 89.00 3 337.84 11.83 89 1
3 2012-11-07 11:38:19 11.82 48.2 5.54 88.60 3 355.92 11.10 89 1
4 2012-11-07 11:38:20 12.61 48.2 5.87 88.38 3 375.72 12.80 89 1
But I don't see a way to influence the concatenation to get the order I need ... Is there another way to get the resulting DataFrame I need?
Here is a solution based on numpy's repeat and array indexing to build de-stacked values, and pandas' merge to output the concatenated result.
First load a sample of your data into a DataFrame (slightly changed read_csv's arguments).
from cStringIO import StringIO
data = """; No Time Date MoistAve MatTemp TDRConduct TDRAve DeltaCount tpAve Moist1 Moist2 Moist3 Moist4 TDR1 TDR2 TDR3 TDR4
1 11:38:17 11.07.2012 11.37 48.20 5.15 88.87 15 344.50 11.84 11.35 11.59 15.25 89.0 89.0 89.0 88.0
2 11:38:18 11.07.2012 11.44 48.20 5.13 88.88 2 346.22 12.08 11.83 -1.00 -1.00 89.0 89.0 -1.0 -1.0
3 11:38:19 11.07.2012 11.10 48.20 4.96 89.00 3 337.84 11.83 11.59 10.62 -1.00 89.0 89.0 89.0 -1.0
4 11:38:19 11.07.2012 11.82 48.20 5.54 88.60 3 355.92 11.10 13.54 12.32 -1.00 89.0 88.0 88.0 -1.0
"""
date_spec = {'FetchTime': [1, 2]}
df = pd.read_csv(StringIO(data), header=0, sep='\s\s+',parse_dates=date_spec, na_values=['-1.0', '-1.00'])
Then build a de-stacked vector of TDRs and merge it with the original data frame
stacked_col_names = ['TDR1','TDR2','TDR3','TDR4']
repeated_row_indexes = np.repeat(np.arange(df.shape[0]),4)
repeated_col_indexes = [np.where(df.columns == c)[0][0] for c in stacked_col_names]
destacked_tdrs = pd.DataFrame(data=df.values[repeated_row_indexes,repeated_col_indexes],index=df.index[repeated_row_indexes],columns=['TDR'])
ouput = pd.merge(left_index = True, right_index = True, left = df, right = destacked_tdrs)
With the desired output :
output.ix[:,['TDR1','TDR2','TDR3','TDR4','TDR']]
TDR1 TDR2 TDR3 TDR4 TDR
0 89 89 89 88 89
0 89 89 89 88 89
0 89 89 89 88 89
0 89 89 89 88 88
1 89 89 NaN NaN 89
1 89 89 NaN NaN 89
1 89 89 NaN NaN NaN
1 89 89 NaN NaN NaN
2 89 89 89 NaN 89
2 89 89 89 NaN 89
2 89 89 89 NaN 89
2 89 89 89 NaN NaN
3 89 88 88 NaN 89
3 89 88 88 NaN 88
3 89 88 88 NaN 88
3 89 88 88 NaN NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With