I have a list of 100K+ elements, this is a finite list. Currently I am using the Data.List function elem. When looking at the Data.List information page there is also find and filter. Would one of these be faster than the elem function?
Just in case we haven't beat the dead horse quite enough...
There is a huge performance difference with different set representations. As an example (which might or might not match your use case) consider taking a list of 200K random elements and a computation to determine the membership of 200 random elements.
I've implemented three obvious ways to do this - using elem
over the list, converting to a HashSet
and checking for membership, and performing a hybrid of Bloom Filters and a Hash Set. The benchmark shows the list solution is 3 orders of magnitude slower than the hash set, which is about 2x slower than the hybrid.
benchmarking list
mean: 460.7106 ms, lb 459.2952 ms, ub 462.8491 ms, ci 0.950
std dev: 8.741096 ms, lb 6.293703 ms, ub 12.23082 ms, ci 0.950
benchmarking hashset
mean: 175.2730 us, lb 173.9140 us, ub 177.0802 us, ci 0.950
std dev: 7.966790 us, lb 6.391454 us, ub 10.25774 us, ci 0.950
benchmarking bloom+hashset
mean: 88.22402 us, lb 87.35856 us, ub 89.66884 us, ci 0.950
std dev: 5.642663 us, lb 3.793715 us, ub 8.264222 us, ci 0.950
And the code:
import qualified Data.HashSet as Set
import Data.HashSet (Set)
import qualified Data.BloomFilter as BF
import qualified Data.BloomFilter.Easy as BF
import Data.BloomFilter (Bloom)
import Data.BloomFilter.Hash as H2
import Data.Hashable as H1
import Criterion.Main
import System.Random
data MySet a = MS (Set a) (Bloom a)
fromList :: (H2.Hashable a, H1.Hashable a, Ord a) => [a] -> MySet a
fromList as =
let hs = Set.fromList as
bf = BF.easyList 0.2 as
in hs `seq` bf `seq` MS hs bf
member :: (H2.Hashable a, H1.Hashable a, Ord a) => a -> MySet a -> Bool
member e (MS hs bf)
| BF.elemB e bf = Set.member e hs
| otherwise = False
main = do
list <- take 200000 `fmap` randomsIO :: IO [Int]
xs <- take 200 `fmap` randomsIO
let hs = Set.fromList list
bhs = fromList list
defaultMain
[ bench "list" $ nf (map (`elem` list)) xs
, bench "hashset" $ nf (map (`Set.member` hs)) xs
, bench "bloom+hashset" $ nf (map (`member` bhs)) xs
]
randomsIO = randoms `fmap` newStdGen
In each case you're going to require a linear traversal of the list. If you're going to be checking for containment repeatedly you should change to a more efficient structure. If you just need to do a single lookup then O(n)
worst case is the best you can get—just look for your element as you create them.
If your types are ordered (instantiate Ord
) then you should use Set
from the containers
package (it's part of the Haskell Platform).
import qualified Data.Set as Set
mySet :: Set.Set Elems
mySet = Set.fromList bigList -- expensive, eventually requires a 1 linear traversal
-- cheaper!
checkElems :: [Elem] -> Set.Set Elems -> [Bool]
checkElems es s = map (\e -> Set.member e s) es
If Ord
isn't possible, you may be able to using hashing instead via the data structures in unordered-containers
. In that package we have Data.HashSet
which is effectively identical to Data.Set
except it requires the (sometimes more liberal, sometimes faster) Hashable
instance instead of Ord
.
If your Elem
type is actually Int
then Data.IntSet
is also a great choice.
Finally, it's worth noting that while Set
is an optimized structure for checking membership, it does throw away repeats. If repeats are valuable you will want to examine other data types or some kinds of preprocessing. Sets with repeats are often called Bags and can be simulated (with similar performance characteristics) by using the Data.Map
, Data.HashMap
, and Data.IntMap
modules. In this case, you store your list as a Data.Map.Map Elem Count
and check for membership by seeing if a particular key is being used in the result map.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With