Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell function nub inefficient

I'm confused by the implementation of the 'nub' (select unique values) function in the Haskell standard library Data.List. The GHC implementation is

nub l                   = nub' l []
  where
    nub' [] _           = []
    nub' (x:xs) ls
        | x `elem` ls   = nub' xs ls
        | otherwise     = x : nub' xs (x:ls)

As far as I can tell, this has a worst-case time complexity of O(n^2), since for a list of unique values it has to compare them all once to see that they are in fact unique.

If one used a hash table, the complexity could be reduced to O(n) for building the table + O(1) for checking each value against previous values in the hash table. Granted, this would not produce an ordered list but that would also be possible in O(n log n) using GHC's own ordered Data.Map, if that is necessary.

Why choose such an inefficient implementation for an important library function? I understand efficiency is not a main concern in Haskell but at least the standard library could make an effort to choose the (asymptotically) best data structure for the job.

like image 305
jforberg Avatar asked Jan 18 '14 21:01

jforberg


3 Answers

Efficiency is quite a concern in Haskell, after all the language performs on par with Java, and beats it in terms of memory consumption, but of course it's not C.

The answer to your question is pretty simple: the Prelude's nub requires only an Eq constraint, while any implementation based on Map or Set would also require either an Ord or Hashable.

like image 57
Nikita Volkov Avatar answered Nov 19 '22 08:11

Nikita Volkov


You're absolutely correct - nub is an O(n^2) algorithm. However, there are still reasons why you might want to use it instead of using a hashmap:

  • for small lists it still might be faster
  • nub only requires the Eq constraint; by comparison Data.Map requires an Ord constraint on keys and Data.HashMap requires a key type with both Hashable and Ord type classes
  • it's lazy - you don't have to run through the entire input list to start getting results

Edit: Slight correction on the third point -- you don't have to process the entire list to start getting results; you'll still have to examine every element of the input list (so nub won't work on infinite lists), but you'll start returning results as soon as you find a unique element.

like image 37
ErikR Avatar answered Nov 19 '22 08:11

ErikR


https://groups.google.com/forum/m/#!msg/haskell-cafe/4UJBbwVEacg/ieMzlWHUT_IJ

In my experience, "beginner" Haskell (including Prelude and the bad packages) simply ignores performance in many cases, in favor of simplicity.

Haskell performance is a complex problem to solve, so if you aren't experienced enough to search through Platform or Hackage for alternatives to the simple nub (and especially if your input is in a List just because you haven't thought about alternative structures), then Data.List.nub is likely not your only major performance problem and also you are probably writing code for a toy project where performance doesn't really matter.

You just have to have faith that when you get to building a large (in code or data) project, you will be more experienced and know how to set up your programs more efficiently.

In other words, don't worry about it, and assume that anything in Haskell 98 that comes from Prelude or base is likely to not be the most efficient way to solve a problem.

like image 4
misterbee Avatar answered Nov 19 '22 09:11

misterbee