Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do a left join on a non unique column/index in Deedle

Tags:

pandas

f#

deedle

I am trying to do a left join between two data frames in Deedle. Examples of the two data frames are below:

let workOrders = 
    Frame.ofColumns [ 
        "workOrderCode" =?> series [ (20050,20050); (20051,20051); (20060,20060) ]
        "workOrderDescription" =?> series [ (20050,"Door Repair"); (20051,"Lift Replacement"); (20060,"Window Cleaning") ]]

// This does not compile due to the duplicate Work Order Codes
let workOrderScores = 
    Frame.ofColumns [ 
       "workOrderCode" => series [ (20050,20050); (20050,20050); (20051,20051) ]
       "runTime" => series [ (20050,20100112); (20050,20100130); (20051,20100215) ] 
       "score" => series [ (20050,100); (20050,120); (20051,80) ]] 

Frame.join JoinKind.Outer workOrders workOrderScores

The problem is that Deedle will not let me create a data frame with a non unique index and I get the following error: System.ArgumentException: Duplicate key '20050'. Duplicate keys are not allowed in the index.

Interestingly in Python/Pandas I can do the following which works perfectly. How can I reproduce this result in Deedle? I am thinking that I might have to flatten the second data frame to remove the duplicates then join and then unpivot/unstack it?

workOrders = pd.DataFrame(
    {'workOrderCode': [20050, 20051, 20060],
        'workOrderDescription': ['Door Repair', 'Lift Replacement', 'Window Cleaning']})

workOrderScores = pd.DataFrame(
    {'workOrderCode': [20050, 20050, 20051],
        'runTime': [20100112, 20100130, 20100215],
        'score' : [100, 120, 80]})

pd.merge(workOrders, workOrderScores, on = 'workOrderCode', how = 'left')

# Result:
#   workOrderCode workOrderDescription   runTime  score
#0          20050          Door Repair  20100112    100
#1          20050          Door Repair  20100130    120
#2          20051     Lift Replacement  20100215     80
#3          20060      Window Cleaning       NaN    NaN
like image 822
jeremyh Avatar asked Feb 16 '14 16:02

jeremyh


1 Answers

This is a great question - I have to admit, there is currently no elegant way to do this with Deedle. Could you please submit an issue to GitHub to make sure we keep track of this and add some solution?

As you say, Deedle does not let you have duplicate values in the keys currently - although, your Pandas solution also does not use duplicate keys - you simply use the fact that Pandas lets you specify the column to use when joining (and I think this would be great addition to Deedle).

Here is one way to do what you wanted - but not very nice. I think using pivoting would be another option (there is a nice pivot table function in the latest source code - not yet on NuGet).

I used groupByRows and nest to turn your data frames into series grouped by the workOrderCode (each item now contains a frame with all rows that have the same work order code):

let workOrders = 
    Frame.ofColumns [ 
        "workOrderCode" =?> Series.ofValues [ 20050; 20051; 20060 ]
        "workOrderDescription" =?> Series.ofValues [ "Door Repair"; "Lift Replacement"; "Window Cleaning" ]]
   |> Frame.groupRowsByInt "workOrderCode"
   |> Frame.nest

let workOrderScores = 
    Frame.ofColumns [ 
       "workOrderCode" => Series.ofValues [ 20050; 20050; 20051 ]
       "runTime" => Series.ofValues [ 20100112; 20100130; 20100215 ] 
       "score" => Series.ofValues [ 100; 120; 80 ]] 
   |> Frame.groupRowsByInt "workOrderCode"
   |> Frame.nest

Now we can join the two series (because their work order codes are the keys). However, then you get one or two data frames for each joined order code and there is quite a lot of work needed to outer join the rows of the two frames:

// Join the two series to align frames with the same work order code
Series.zip workOrders workOrderScores
|> Series.map(fun _ (orders, scores) -> 
    match orders, scores with
    | OptionalValue.Present s1, OptionalValue.Present s2 ->
        // There is a frame with some rows with the specified code in both 
        // work orders and work order scores - we return a cross product of their rows
        [ for r1 in s1.Rows.Values do
          for r2 in s2.Rows.Values do 
          // Drop workOrderCode from one series (they are the same in both)
          // and append the two rows & return that as the result
          yield Series.append r1 (Series.filter (fun k _ -> k <> "workOrderCode") r2) ]
        |> Frame.ofRowsOrdinal
    // If left or right value is missing, we just return the columns
    // that are available (others will be filled with NaN)
    | OptionalValue.Present s, _ 
    | _, OptionalValue.Present s -> s)
|> Frame.unnest
|> Frame.indexRowsOrdinally

This might be slow (especially in the NuGet version). If you work on more data, please try building latest version of Deedle from sources (and if that does not help, please submit an issue - we should look into this!)

like image 100
Tomas Petricek Avatar answered Nov 15 '22 09:11

Tomas Petricek