Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell: Best way to search a list of large size

I have a list of 100K+ elements, this is a finite list. Currently I am using the Data.List function elem. When looking at the Data.List information page there is also find and filter. Would one of these be faster than the elem function?

like image 249
tucker19 Avatar asked Mar 20 '23 02:03

tucker19


2 Answers

Just in case we haven't beat the dead horse quite enough...

There is a huge performance difference with different set representations. As an example (which might or might not match your use case) consider taking a list of 200K random elements and a computation to determine the membership of 200 random elements.

I've implemented three obvious ways to do this - using elem over the list, converting to a HashSet and checking for membership, and performing a hybrid of Bloom Filters and a Hash Set. The benchmark shows the list solution is 3 orders of magnitude slower than the hash set, which is about 2x slower than the hybrid.

benchmarking list
mean: 460.7106 ms, lb 459.2952 ms, ub 462.8491 ms, ci 0.950
std dev: 8.741096 ms, lb 6.293703 ms, ub 12.23082 ms, ci 0.950

benchmarking hashset
mean: 175.2730 us, lb 173.9140 us, ub 177.0802 us, ci 0.950
std dev: 7.966790 us, lb 6.391454 us, ub 10.25774 us, ci 0.950

benchmarking bloom+hashset
mean: 88.22402 us, lb 87.35856 us, ub 89.66884 us, ci 0.950
std dev: 5.642663 us, lb 3.793715 us, ub 8.264222 us, ci 0.950

And the code:

import qualified Data.HashSet as Set
import           Data.HashSet (Set)
import qualified Data.BloomFilter as BF
import qualified Data.BloomFilter.Easy as BF
import           Data.BloomFilter (Bloom)
import           Data.BloomFilter.Hash as H2
import           Data.Hashable as H1
import Criterion.Main
import System.Random

data MySet a = MS (Set a) (Bloom a)

fromList :: (H2.Hashable a, H1.Hashable a, Ord a) => [a] -> MySet a
fromList as =
    let hs = Set.fromList as
        bf = BF.easyList 0.2 as
    in hs `seq` bf `seq` MS hs bf

member :: (H2.Hashable a, H1.Hashable a, Ord a) => a -> MySet a -> Bool
member e (MS hs bf)
    | BF.elemB e bf = Set.member e hs
    | otherwise      = False

main = do
  list   <- take 200000 `fmap` randomsIO :: IO [Int]
  xs     <- take 200    `fmap` randomsIO
  let hs  = Set.fromList list
      bhs = fromList list
  defaultMain
        [ bench "list" $ nf (map (`elem` list)) xs
        , bench "hashset" $ nf (map (`Set.member` hs)) xs
        , bench "bloom+hashset" $ nf (map (`member` bhs)) xs
        ]

randomsIO = randoms `fmap` newStdGen
like image 80
Thomas M. DuBuisson Avatar answered Apr 02 '23 04:04

Thomas M. DuBuisson


In each case you're going to require a linear traversal of the list. If you're going to be checking for containment repeatedly you should change to a more efficient structure. If you just need to do a single lookup then O(n) worst case is the best you can get—just look for your element as you create them.

If your types are ordered (instantiate Ord) then you should use Set from the containers package (it's part of the Haskell Platform).

import qualified Data.Set as Set

mySet :: Set.Set Elems
mySet = Set.fromList bigList -- expensive, eventually requires a 1 linear traversal

-- cheaper!
checkElems :: [Elem] -> Set.Set Elems -> [Bool]
checkElems es s = map (\e -> Set.member e s) es

If Ord isn't possible, you may be able to using hashing instead via the data structures in unordered-containers. In that package we have Data.HashSet which is effectively identical to Data.Set except it requires the (sometimes more liberal, sometimes faster) Hashable instance instead of Ord.

If your Elem type is actually Int then Data.IntSet is also a great choice.

Finally, it's worth noting that while Set is an optimized structure for checking membership, it does throw away repeats. If repeats are valuable you will want to examine other data types or some kinds of preprocessing. Sets with repeats are often called Bags and can be simulated (with similar performance characteristics) by using the Data.Map, Data.HashMap, and Data.IntMap modules. In this case, you store your list as a Data.Map.Map Elem Count and check for membership by seeing if a particular key is being used in the result map.

like image 28
J. Abrahamson Avatar answered Apr 02 '23 04:04

J. Abrahamson