Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange behavior of groupBy

Tags:

haskell

I am wondering why the following call of groupBy does not work: My predicate is x < y, so I expect [1, 6] to be a group, but instead, Haskell put [1, 6, 4, 2] into a group.

Prelude Data.List> groupBy (\x y -> x < y) [8,5,3,2,1,6,4,2]
[[8],[5],[3],[2],[1,6,4,2]]

More strangely, when I change the last number to -2, I expect the same behavior as in the above example. That is, since both 2 and -2 are less than 4, I expect that in the result [1, 6, 4, -2] would make up a group. But instead, This time, Haskell put -2 to be a group.

Prelude Data.List> groupBy (\x y -> x < y) [8,5,3,2,1,6,4,-2]
[[8],[5],[3],[2],[1,6,4],[-2]]

Do I have a wrong understanding of groupBy?

like image 420
gefei Avatar asked Jul 31 '21 10:07

gefei


People also ask

What does Groupby mean and what is it used for?

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups. Used to determine the groups for the groupby.

What is possible using Groupby () method of pandas?

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.

Is pandas Groupby lazy?

It is mentioned in this tutorial that pandas groupby object is lazy. it's lazy in nature. It doesn't really do any operations to produce a useful result until you say so.

What does Groupby size do?

groupby() function is one of the most useful function in the library it splits the data into groups based on columns/conditions and then apply some operations eg. size() which counts the number of entries/rows in each group. The groupby() can also be applied on series.


2 Answers

In the implementation of the groupBy, x is always the first item of the sublist. Indeed, groupBy is implemented as:

groupBy                 :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy _  []           =  []
groupBy eq (x:xs)       =  (x:ys) : groupBy eq zs
                           where (ys,zs) = span (eq x) xs

especially the span (eq x) is important here, since x will be the first item of a new group.

Since x is thus not the previous value in the list. If we thus run groupBy with the list [5, 3, 2, 1, 6, 4, -2], we get:

list current list x=? check with outcome
[5,3,2,1,6,4,-2] [8] 8 / /
[5,3,2,1,6,4,-2] [8] 8 5 False
[3,2,1,6,4,-2] [5] 5 / /
[3,2,1,6,4,-2] [5] 5 3 False
[3,2,1,6,4,-2] [3] 3 / /
[2,1,6,4,-2] [3] 3 2 False
[2,1,6,4,-2] [2] 2 1 False
[1,6,4,-2] [2] 2 / /
[1,6,4,-2] [2] 2 1 False
[6,4,-2] [1] 1 / /
[4,-2] [1,6] 1 6 True
[-2] [1,6,4] 1 4 True
[] [-2] -2 / /

Especially the case where we compare x=1 and y=4 is important. If x was only the previous value, we should start generating a new list, but since x is the first item of the list, that is not the case.

Normally you should only work with an equivalence relation ~ [wiki], such relation is:

  1. reflexive: so x ~ x is true;
  2. symmetric: so x ~ y if and only if y ~ x; and
  3. transitive: so x ~ y and y ~ z implies that x ~ z.

Your equivalence relation is not reflexive, nor is it symmetric. This is thus not a valid function to work with groupBy.

like image 66
Willem Van Onsem Avatar answered Oct 22 '22 09:10

Willem Van Onsem


The conceptual definition of groupBy p l is that it yields sublists of l such that for each xs in l, you have

all (==True) [p x y | x<-xs, y<-xs]

IOW, each sublist should be part of an equivalence class of p. That notion only makes sense if p is an equivalence relation. In particular, you need p x y ≡ p y x, and the defining equation also assumes that p x x is always true.

The implementation in the standard libraries shows that idea quite clearly: each x:ys list in the result has ys defined by the span of elements that are equivalent to x by the relation. So in your case, you get 1:[6,4,2], where 6,4,2 are all greater than 1.

Evidently, groupBy doesn't actually check p x y for all pairs of elements in the result lists, so this really only makes sense if p is indeed an equivalence relation.

What you expected the idea to be – and IMO this is not unreasonable – is that only for all x,y such that x is the left neighbour of y, you want p x y to hold. This is in general a weaker condition, but if p is an equivalence relation then it actually implies the original condition, because such a relation also is transitive. So maybe the implementation should actually be

groupBy' :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy' _ [] = []
groupBy' _ (x:l) = (x:xs) : zss
 where (xs,zss) = case l of
        [] -> ([],[])
        zs@(y:_)
         -> let ys:zss' = groupBy' p zs
            in if p x y then (ys, zss')
                        else ([], ys:zss')

(This could be simplified a bit, but then it wouldn't be as lazy as the old implementation.)

like image 38
leftaroundabout Avatar answered Oct 22 '22 11:10

leftaroundabout