How to join many fragmented time series in one regular Pandas DataFrame in Python

Tags:

I have to work with time series data imported from some CSVs which may look like this:

import pandas as pd

csv_a = [["Sensor_1", '2019-05-25 10:00', 25, 60],
         ["Sensor_2", '2019-05-25 10:00', 30, 45],
         ["Sensor_1", '2019-05-25 10:05', 26, None],
         ["Sensor_2", '2019-05-25 10:05', 30, 46],
         ["Sensor_1", '2019-05-25 10:10', 27, 63],
         ["Sensor_1", '2019-05-25 10:20', 28, 62]]

df_a = pd.DataFrame(csv_a, columns=["Sensor", "Timestamp", "Temperature", "Humidity"])
df_a["Timestamp"] = (pd.to_datetime(df_a["Timestamp"]))

csv_b = [["Sensor_1", '2019-05-25 10:05', 1020],
         ["Sensor_2", '2019-05-25 10:05', 956],
         ["Sensor_3", '2019-05-25 10:05', 990],
         ["Sensor_1", '2019-05-25 10:10', 1021],
         ["Sensor_2", '2019-05-25 10:10', 957],
         ["Sensor_3", '2019-05-25 10:10', 992],
         ["Sensor_1", '2019-05-25 10:15', 1019]]

df_b = pd.DataFrame(csv_b, columns=["Sensor", "Timestamp", "Pressure"])
df_b["Timestamp"] = (pd.to_datetime(df_b["Timestamp"]))

As you can see, we have 3 sensors. Each sensor has its own time series with measures of temperature, humidity and pressure. However, the data is fragmented in two CSVs and it can have many gaps, etc.

The objetive is to join all the data in one ordered and regular dataframe like this:

              Timestamp    Sensor  Temperature  Humidity  Pressure
0   2019-05-25 10:00:00  Sensor_1         25.0      60.0       NaN
1   2019-05-25 10:00:00  Sensor_2         30.0      45.0       NaN
2   2019-05-25 10:00:00  Sensor_3          NaN       NaN       NaN
3   2019-05-25 10:05:00  Sensor_1         26.0       NaN    1020.0
4   2019-05-25 10:05:00  Sensor_2         30.0      46.0     956.0
5   2019-05-25 10:05:00  Sensor_3          NaN       NaN     990.0
6   2019-05-25 10:10:00  Sensor_1         27.0      63.0    1021.0
7   2019-05-25 10:10:00  Sensor_2          NaN       NaN     957.0
8   2019-05-25 10:10:00  Sensor_3          NaN       NaN     992.0
9   2019-05-25 10:15:00  Sensor_1          NaN       NaN    1019.0
10  2019-05-25 10:15:00  Sensor_2          NaN       NaN       NaN
11  2019-05-25 10:15:00  Sensor_3          NaN       NaN       NaN
12  2019-05-25 10:20:00  Sensor_1         28.0      62.0       NaN
13  2019-05-25 10:20:00  Sensor_2          NaN       NaN       NaN
14  2019-05-25 10:20:00  Sensor_3          NaN       NaN       NaN

The logic behind this is to realize that, globally speaking, the data in the CSVs starts at 10:00 and ends at 10:20. And that we have 3 possible variables for 3 different sensors. So I want the 2 first columns (Timestamp and Sensor) to be regular, ordered and without gaps. The remaining columns (Temperature, Humidity and Pressure) will be filled when possible with the data from the CSV.

I have tried to perform this using the pandas merge function in many different ways but I can't obtain the result I want. I hope someone more experienced can help me.

306

asked Aug 01 '19 08:08

eliteA92

1 Answers

First join both DataFrames together by concat with DataFrame.set_index and if possible duplicates use sum for unique MultiIndex created by timestamps and Sensors.

Then add missing rows with DataFrame.reindex by MultiIndex.from_product with minumal and maximal dates by date_range:

df = (pd.concat([df_a.set_index(['Timestamp','Sensor']), 
                df_b.set_index(['Timestamp','Sensor'])], sort=True)
        .sum(level=[0,1],min_count=1))

d = df.index.get_level_values(0)
mux = pd.MultiIndex.from_product([pd.date_range(d.min(), d.max(), freq='5Min'), 
                                  df.index.get_level_values(1).unique()], names=df.index.names)
df = df.reindex(mux).reset_index()
print (df)

             Timestamp    Sensor  Humidity  Pressure  Temperature
0  2019-05-25 10:00:00  Sensor_1      60.0       NaN         25.0
1  2019-05-25 10:00:00  Sensor_2      45.0       NaN         30.0
2  2019-05-25 10:00:00  Sensor_3       NaN       NaN          NaN
3  2019-05-25 10:05:00  Sensor_1       NaN    1020.0         26.0
4  2019-05-25 10:05:00  Sensor_2      46.0     956.0         30.0
5  2019-05-25 10:05:00  Sensor_3       NaN     990.0          NaN
6  2019-05-25 10:10:00  Sensor_1      63.0    1021.0         27.0
7  2019-05-25 10:10:00  Sensor_2       NaN     957.0          NaN
8  2019-05-25 10:10:00  Sensor_3       NaN     992.0          NaN
9  2019-05-25 10:15:00  Sensor_1       NaN    1019.0          NaN
10 2019-05-25 10:15:00  Sensor_2       NaN       NaN          NaN
11 2019-05-25 10:15:00  Sensor_3       NaN       NaN          NaN
12 2019-05-25 10:20:00  Sensor_1      62.0       NaN         28.0
13 2019-05-25 10:20:00  Sensor_2       NaN       NaN          NaN
14 2019-05-25 10:20:00  Sensor_3       NaN       NaN          NaN

138

answered Oct 25 '22 02:10

jezrael

Related questions
                            
                                Unable to make my script process locally created server response in the right way
                            
                                avoid repeating the dataframe name when operating on pandas columns
                            
                                JupyterLab build is suggested and successfully installed, but will not work. Why?
                            
                                Python SVG converter creates empty file
                            
                                is this betweenness calculation correct?
                            
                                Fastest way to compute angle between 2D vectors
                            
                                RuntimeError: Task got Future <Future pending> attached to a different loop
                            
                                What is purpose of django.setup()?
                            
                                Using Pandas df.where on multiple columns produces unexpected NaN values
                            
                                Unexpected results with CuDNNLSTM (instead of LSTM) layer
                            
                                How does PIL's Image.convert() function work with mode 'P'
                            
                                How to type Python mixin with superclass calls?
                            
                                How to paste in a specific place with Python PIL?
                            
                                ValueError: The model is not configured to compute accuracy
                            
                                Automating database creation for testing
                            
                                How to find nearest divisor to given value with modulo zero
                            
                                Logging DEBUG logs are not shown when executing the Python Azure Functions
                            
                                Pandas - substring each row with a different length
                            
                                ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL:
                            
                                AWS Lambda - SQS Integration with Exponential Backoff

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to join many fragmented time series in one regular Pandas DataFrame in Python

Tags:

python

pandas

dataframe

time-series

eliteA92

People also ask

1 Answers

jezrael

Recent Activity

Donate For Us