Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dataframe: transform row-based transaction data into aggregates per date

Tags:

I retrieve data from a SQLITE Database (and transform it to a pandas dataframe) in the following format:

Driver | Date loading | Date unloading | Loading Adress | Unloading Address
Peter  | 02.05.2020   | 03.05.2020     | 12342, Berlin  | 14221, Utrecht
Peter  | 03.05.2020   | 04.05.2020     | 14221, Utrecht | 13222, Amsterdam
Franz  | 03.05.2020   | 03.05.2020     | 11111, Somewher| 11221, Somewhere2
Franz  | 03.05.2020   | 05.05.2020     | 11223, Upsalla | 14231, Berlin

The date range can be specified for the query, so that it gives an overview over which driver has which transports to deliver within the specified date range, ordered by date.

The goal of the transformation I want to do is a weekly plan for each driver, with the dates from the range sorted in the available columns. So for the data above, this would look like the following:

Driver | 02.05.2020           | 03.05.2020            | 04.05.2020         | 05.05.2020      |
Peter  | Loading:             | Unloading:              Unloading:
         12342, Berlin          14221, Utrecht          13222, Amsterdam
                                Loading:
                                14221, Utrecht

Franz  |                      | Loading:              |                    | Unloading:
                                11111, Somewher                              14231, Berlin
                                Unloading:
                                11221, Somewhere2
                                Loading:
                                11223, Upsalla

Is there any way to achieve the described output with dataframe operations? Within the single data columns I will need to keep the order, which is loading first, unloading second, and then go to the next data row if the date is the same.

like image 943
Cut7er Avatar asked Jun 05 '20 10:06

Cut7er


1 Answers

I will write my pseudoish solution, where actually it is a solution, it only lacks a single entity, task_id, I will elaborate this later. I will call your dateframe (1st one in question) df and I create a transformed version as t_df. This t_df will be an unpivoted table where dates and addresses are unified.

I will create a data frame as follows:

Driver | Date         | Task       | Address 
Peter  | 02.05.2020   | Loading    | 12342, Berlin
Peter  | 03.05.2020   | Unloading  | 14221, Utrecht

with this unpivot data frame I can now pivot it as you wanted Like a schedule.

m,n = df.shape
t_df = pd.DataFrame(columns=['driver', 'date', 'task', 'address'])
t_df['Driver'] = df['Driver'].tolist() * 2
t_df['Date'] = df['Date loading'].tolist() + df['Date unloading'].tolist()
t_df['Address'] = df['Loading Address'].tolist() + df['Unloading Address'].tolist()
t_df['Task'] = ['Loading'] * m + ['Unloading'] * m

Now, I add the values task + address as a column.

t_df['Compound'] = t_df[['Task', 'Address']].agg(': '.join, axis=1)

concat_array = lambda x: '; '.join(x)

schedule = pd.crosstab(index=t_df['Driver'], columns=t_df['Date'], values=t_df['Compound'],
    aggfunc=concat_array)

I will get the following data frame:

Date                02.05.2020  ...                05.05.2020
Driver                          ...                          
Franz                      NaN  ...  Unloading: 14231, Berlin
Peter   Loading: 12342, Berlin  ...                       NaN

Now, as I said at the very beginning of the answer, you need some kind of task identifier to match which task belongs to which assuming there are multiple load & unload operations in the same day. You need to assign some kind of task_id and then put it in the Compound column.

Note: I used '; ' for separation of the tasks, you may want to use something else.

You can check complete code file in gist.

like image 145
null Avatar answered Sep 30 '22 18:09

null