Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging time series data by timestamp using numpy/pandas

I have time series data from three completely different sensor sources as CSV files and want to combine them into one big CSV file. I've managed to read them into numpy using numpy's genfromtxt, but I'm not sure what to do from here.

Basically, what I have is something like this:

Table 1:

timestamp    val_a   val_b   val_c

Table 2:

timestamp    val_d   val_e   val_f   val_g

Table 3:

timestamp    val_h   val_i

All timestamps are UNIX millisecond timestamps as numpy.uint64.

And what I want is:

timestamp    val_a   val_b   val_c   val_d   val_e   val_f   val_g   val_h   val_i

...where all data is combined and ordered by timestamps. Each of the three tables is already ordered by timestamp. Since the data comes from different sources, there is no guarantee that a timestamp from table 1 will also be in table 2 or 3 and vice versa. In that case, the empty values should be marked as N/A.

So far, I have tried using pandas to convert the data like so:

df_sensor1 = pd.DataFrame(numpy_arr_sens1)
df_sensor2 = pd.DataFrame(numpy_arr_sens2)
df_sensor3 = pd.DataFrame(numpy_arr_sens3)

and then tried using pandas.DataFrame.merge, but I'm pretty sure that won't work for what I'm trying to do now. Can anyone point me in the right direction?

like image 403
vind Avatar asked Aug 25 '15 22:08

vind


Video Answer


1 Answers

I think that you can simply

  • Define the timestamp as the index of each DataFrame (use of set_index)
  • Use a join to merge them with the 'outer' method
  • Optionnaly convert timestamp to datetime

Here is what it looks like.

# generating some test data
timestamp = [1440540000, 1450540000]
df1 = pd.DataFrame(
    {'timestamp': timestamp, 'a': ['val_a', 'val2_a'], 'b': ['val_b', 'val2_b'], 'c': ['val_c', 'val2_c']})
# building a different index
timestamp = timestamp * np.random.randn(abs(1))
df2 = pd.DataFrame(
    {'timestamp': timestamp, 'd': ['val_d', 'val2_d'], 'e': ['val_e', 'val2_e'], 'f': ['val_f', 'val2_f'],
     'g': ['val_g', 'val2_g']}, index=index)
# keeping a value in common with the first index
timestamp = [1440540000, 1450560000]
df3 = pd.DataFrame({'timestamp': timestamp, 'h': ['val_h', 'val2_h'], 'i': ['val_i', 'val2_i']}, index=index)

# Setting the timestamp as the index
df1.set_index('timestamp', inplace=True)
df2.set_index('timestamp', inplace=True)
df3.set_index('timestamp', inplace=True)

# You can convert timestamps to dates but it's not mandatory I think
df1.index = pd.to_datetime(df1.index, unit='s')
df2.index = pd.to_datetime(df2.index, unit='s')
df3.index = pd.to_datetime(df3.index, unit='s')

# Just perform a join and that's it
result = df1.join(df2, how='outer').join(df3, how='outer')
result

result

like image 151
Romain Avatar answered Oct 21 '22 09:10

Romain