Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to join many fragmented time series in one regular Pandas DataFrame in Python

I have to work with time series data imported from some CSVs which may look like this:

import pandas as pd

csv_a = [["Sensor_1", '2019-05-25 10:00', 25, 60],
         ["Sensor_2", '2019-05-25 10:00', 30, 45],
         ["Sensor_1", '2019-05-25 10:05', 26, None],
         ["Sensor_2", '2019-05-25 10:05', 30, 46],
         ["Sensor_1", '2019-05-25 10:10', 27, 63],
         ["Sensor_1", '2019-05-25 10:20', 28, 62]]

df_a = pd.DataFrame(csv_a, columns=["Sensor", "Timestamp", "Temperature", "Humidity"])
df_a["Timestamp"] = (pd.to_datetime(df_a["Timestamp"]))

csv_b = [["Sensor_1", '2019-05-25 10:05', 1020],
         ["Sensor_2", '2019-05-25 10:05', 956],
         ["Sensor_3", '2019-05-25 10:05', 990],
         ["Sensor_1", '2019-05-25 10:10', 1021],
         ["Sensor_2", '2019-05-25 10:10', 957],
         ["Sensor_3", '2019-05-25 10:10', 992],
         ["Sensor_1", '2019-05-25 10:15', 1019]]

df_b = pd.DataFrame(csv_b, columns=["Sensor", "Timestamp", "Pressure"])
df_b["Timestamp"] = (pd.to_datetime(df_b["Timestamp"]))

As you can see, we have 3 sensors. Each sensor has its own time series with measures of temperature, humidity and pressure. However, the data is fragmented in two CSVs and it can have many gaps, etc.

The objetive is to join all the data in one ordered and regular dataframe like this:

              Timestamp    Sensor  Temperature  Humidity  Pressure
0   2019-05-25 10:00:00  Sensor_1         25.0      60.0       NaN
1   2019-05-25 10:00:00  Sensor_2         30.0      45.0       NaN
2   2019-05-25 10:00:00  Sensor_3          NaN       NaN       NaN
3   2019-05-25 10:05:00  Sensor_1         26.0       NaN    1020.0
4   2019-05-25 10:05:00  Sensor_2         30.0      46.0     956.0
5   2019-05-25 10:05:00  Sensor_3          NaN       NaN     990.0
6   2019-05-25 10:10:00  Sensor_1         27.0      63.0    1021.0
7   2019-05-25 10:10:00  Sensor_2          NaN       NaN     957.0
8   2019-05-25 10:10:00  Sensor_3          NaN       NaN     992.0
9   2019-05-25 10:15:00  Sensor_1          NaN       NaN    1019.0
10  2019-05-25 10:15:00  Sensor_2          NaN       NaN       NaN
11  2019-05-25 10:15:00  Sensor_3          NaN       NaN       NaN
12  2019-05-25 10:20:00  Sensor_1         28.0      62.0       NaN
13  2019-05-25 10:20:00  Sensor_2          NaN       NaN       NaN
14  2019-05-25 10:20:00  Sensor_3          NaN       NaN       NaN

The logic behind this is to realize that, globally speaking, the data in the CSVs starts at 10:00 and ends at 10:20. And that we have 3 possible variables for 3 different sensors. So I want the 2 first columns (Timestamp and Sensor) to be regular, ordered and without gaps. The remaining columns (Temperature, Humidity and Pressure) will be filled when possible with the data from the CSV.

I have tried to perform this using the pandas merge function in many different ways but I can't obtain the result I want. I hope someone more experienced can help me.

like image 306
eliteA92 Avatar asked Aug 01 '19 08:08

eliteA92


People also ask

How do I join multiple series in pandas?

Combine Two Series Using pandas. merge() can be used for all database join operations between DataFrame or named series objects. You have to pass an extra parameter “name” to the series in this case. For instance, pd. merge(S1, S2, right_index=True, left_index=True) .

How do I combine multiple pandas in DF?

Pandas merge() function is used to merge multiple Dataframes. We can use either pandas. merge() or DataFrame. merge() to merge multiple Dataframes.

How do you combine three series in a data frame?

You can create a DataFrame from multiple Series objects by adding each series as a columns. By using concat() method you can merge multiple series together into DataFrame. This takes several params, for our scenario we use list that takes series to combine and axis=1 to specify merge series as columns instead of rows.


1 Answers

First join both DataFrames together by concat with DataFrame.set_index and if possible duplicates use sum for unique MultiIndex created by timestamps and Sensors.

Then add missing rows with DataFrame.reindex by MultiIndex.from_product with minumal and maximal dates by date_range:

df = (pd.concat([df_a.set_index(['Timestamp','Sensor']), 
                df_b.set_index(['Timestamp','Sensor'])], sort=True)
        .sum(level=[0,1],min_count=1))

d = df.index.get_level_values(0)
mux = pd.MultiIndex.from_product([pd.date_range(d.min(), d.max(), freq='5Min'), 
                                  df.index.get_level_values(1).unique()], names=df.index.names)
df = df.reindex(mux).reset_index()
print (df)

             Timestamp    Sensor  Humidity  Pressure  Temperature
0  2019-05-25 10:00:00  Sensor_1      60.0       NaN         25.0
1  2019-05-25 10:00:00  Sensor_2      45.0       NaN         30.0
2  2019-05-25 10:00:00  Sensor_3       NaN       NaN          NaN
3  2019-05-25 10:05:00  Sensor_1       NaN    1020.0         26.0
4  2019-05-25 10:05:00  Sensor_2      46.0     956.0         30.0
5  2019-05-25 10:05:00  Sensor_3       NaN     990.0          NaN
6  2019-05-25 10:10:00  Sensor_1      63.0    1021.0         27.0
7  2019-05-25 10:10:00  Sensor_2       NaN     957.0          NaN
8  2019-05-25 10:10:00  Sensor_3       NaN     992.0          NaN
9  2019-05-25 10:15:00  Sensor_1       NaN    1019.0          NaN
10 2019-05-25 10:15:00  Sensor_2       NaN       NaN          NaN
11 2019-05-25 10:15:00  Sensor_3       NaN       NaN          NaN
12 2019-05-25 10:20:00  Sensor_1      62.0       NaN         28.0
13 2019-05-25 10:20:00  Sensor_2       NaN       NaN          NaN
14 2019-05-25 10:20:00  Sensor_3       NaN       NaN          NaN
like image 138
jezrael Avatar answered Oct 25 '22 02:10

jezrael