I am trying to do a left join between two data frames in Deedle. Examples of the two data frames are below:
let workOrders =
Frame.ofColumns [
"workOrderCode" =?> series [ (20050,20050); (20051,20051); (20060,20060) ]
"workOrderDescription" =?> series [ (20050,"Door Repair"); (20051,"Lift Replacement"); (20060,"Window Cleaning") ]]
// This does not compile due to the duplicate Work Order Codes
let workOrderScores =
Frame.ofColumns [
"workOrderCode" => series [ (20050,20050); (20050,20050); (20051,20051) ]
"runTime" => series [ (20050,20100112); (20050,20100130); (20051,20100215) ]
"score" => series [ (20050,100); (20050,120); (20051,80) ]]
Frame.join JoinKind.Outer workOrders workOrderScores
The problem is that Deedle will not let me create a data frame with a non unique index and I get the following error: System.ArgumentException: Duplicate key '20050'. Duplicate keys are not allowed in the index.
Interestingly in Python/Pandas I can do the following which works perfectly. How can I reproduce this result in Deedle? I am thinking that I might have to flatten the second data frame to remove the duplicates then join and then unpivot/unstack it?
workOrders = pd.DataFrame(
{'workOrderCode': [20050, 20051, 20060],
'workOrderDescription': ['Door Repair', 'Lift Replacement', 'Window Cleaning']})
workOrderScores = pd.DataFrame(
{'workOrderCode': [20050, 20050, 20051],
'runTime': [20100112, 20100130, 20100215],
'score' : [100, 120, 80]})
pd.merge(workOrders, workOrderScores, on = 'workOrderCode', how = 'left')
# Result:
# workOrderCode workOrderDescription runTime score
#0 20050 Door Repair 20100112 100
#1 20050 Door Repair 20100130 120
#2 20051 Lift Replacement 20100215 80
#3 20060 Window Cleaning NaN NaN
This is a great question - I have to admit, there is currently no elegant way to do this with Deedle. Could you please submit an issue to GitHub to make sure we keep track of this and add some solution?
As you say, Deedle does not let you have duplicate values in the keys currently - although, your Pandas solution also does not use duplicate keys - you simply use the fact that Pandas lets you specify the column to use when joining (and I think this would be great addition to Deedle).
Here is one way to do what you wanted - but not very nice. I think using pivoting would be another option (there is a nice pivot table function in the latest source code - not yet on NuGet).
I used groupByRows
and nest
to turn your data frames into series grouped by the workOrderCode
(each item now contains a frame with all rows that have the same work order code):
let workOrders =
Frame.ofColumns [
"workOrderCode" =?> Series.ofValues [ 20050; 20051; 20060 ]
"workOrderDescription" =?> Series.ofValues [ "Door Repair"; "Lift Replacement"; "Window Cleaning" ]]
|> Frame.groupRowsByInt "workOrderCode"
|> Frame.nest
let workOrderScores =
Frame.ofColumns [
"workOrderCode" => Series.ofValues [ 20050; 20050; 20051 ]
"runTime" => Series.ofValues [ 20100112; 20100130; 20100215 ]
"score" => Series.ofValues [ 100; 120; 80 ]]
|> Frame.groupRowsByInt "workOrderCode"
|> Frame.nest
Now we can join the two series (because their work order codes are the keys). However, then you get one or two data frames for each joined order code and there is quite a lot of work needed to outer join the rows of the two frames:
// Join the two series to align frames with the same work order code
Series.zip workOrders workOrderScores
|> Series.map(fun _ (orders, scores) ->
match orders, scores with
| OptionalValue.Present s1, OptionalValue.Present s2 ->
// There is a frame with some rows with the specified code in both
// work orders and work order scores - we return a cross product of their rows
[ for r1 in s1.Rows.Values do
for r2 in s2.Rows.Values do
// Drop workOrderCode from one series (they are the same in both)
// and append the two rows & return that as the result
yield Series.append r1 (Series.filter (fun k _ -> k <> "workOrderCode") r2) ]
|> Frame.ofRowsOrdinal
// If left or right value is missing, we just return the columns
// that are available (others will be filled with NaN)
| OptionalValue.Present s, _
| _, OptionalValue.Present s -> s)
|> Frame.unnest
|> Frame.indexRowsOrdinally
This might be slow (especially in the NuGet version). If you work on more data, please try building latest version of Deedle from sources (and if that does not help, please submit an issue - we should look into this!)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With