I have time series data from three completely different sensor sources as CSV files and want to combine them into one big CSV file. I've managed to read them into numpy using numpy's genfromtxt, but I'm not sure what to do from here.
Basically, what I have is something like this:
Table 1:
timestamp val_a val_b val_c
Table 2:
timestamp val_d val_e val_f val_g
Table 3:
timestamp val_h val_i
All timestamps are UNIX millisecond timestamps as numpy.uint64.
And what I want is:
timestamp val_a val_b val_c val_d val_e val_f val_g val_h val_i
...where all data is combined and ordered by timestamps. Each of the three tables is already ordered by timestamp. Since the data comes from different sources, there is no guarantee that a timestamp from table 1 will also be in table 2 or 3 and vice versa. In that case, the empty values should be marked as N/A.
So far, I have tried using pandas to convert the data like so:
df_sensor1 = pd.DataFrame(numpy_arr_sens1)
df_sensor2 = pd.DataFrame(numpy_arr_sens2)
df_sensor3 = pd.DataFrame(numpy_arr_sens3)
and then tried using pandas.DataFrame.merge, but I'm pretty sure that won't work for what I'm trying to do now. Can anyone point me in the right direction?
I think that you can simply
timestamp
as the index
of each DataFrame
(use of set_index
)join
to merge them with the 'outer'
methodtimestamp
to datetime
Here is what it looks like.
# generating some test data
timestamp = [1440540000, 1450540000]
df1 = pd.DataFrame(
{'timestamp': timestamp, 'a': ['val_a', 'val2_a'], 'b': ['val_b', 'val2_b'], 'c': ['val_c', 'val2_c']})
# building a different index
timestamp = timestamp * np.random.randn(abs(1))
df2 = pd.DataFrame(
{'timestamp': timestamp, 'd': ['val_d', 'val2_d'], 'e': ['val_e', 'val2_e'], 'f': ['val_f', 'val2_f'],
'g': ['val_g', 'val2_g']}, index=index)
# keeping a value in common with the first index
timestamp = [1440540000, 1450560000]
df3 = pd.DataFrame({'timestamp': timestamp, 'h': ['val_h', 'val2_h'], 'i': ['val_i', 'val2_i']}, index=index)
# Setting the timestamp as the index
df1.set_index('timestamp', inplace=True)
df2.set_index('timestamp', inplace=True)
df3.set_index('timestamp', inplace=True)
# You can convert timestamps to dates but it's not mandatory I think
df1.index = pd.to_datetime(df1.index, unit='s')
df2.index = pd.to_datetime(df2.index, unit='s')
df3.index = pd.to_datetime(df3.index, unit='s')
# Just perform a join and that's it
result = df1.join(df2, how='outer').join(df3, how='outer')
result
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With