Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is HashSet<T>.Contains() efficient with large lists, multiple threads?

I am writing a multi-threaded program to scrape a certain site and collect ID's. It is storing these ID's in a shared static List<string> object.

When any item is added to the List<string>, it is first checked against a HashSet<string> which contains a blacklist of already collected ID's.

I do this as follows:

private static HashSet<string> Blacklist = new HashSet<string>();
private static List<string> IDList = new List<string>();

public static void AddIDToIDList(string ID)
{
    lock (IDList)
    {
        if (IsIDBlacklisted(ID))
            return;
        IDList.Add(ID);
    }
}
public static bool IsIDBlacklisted(string ID)
{
    lock (Blacklist)
    {
        if (Blacklist.Contains(ID))
            return true;
    }
    return false;
 }

The Blacklist is saved to a file after finishing and is loaded every time the program starts, therefore, it will get pretty large over time (up to 50k records). Is there a more efficient way to not only store this blacklist, but also to check each ID against it?

Thanks!

like image 292
blizz Avatar asked Dec 05 '25 03:12

blizz


2 Answers

To improve performance try to use ConcurrentBag<T> collection. Also there is no need to lock BlackList because it's not being modified e.g.:

private static HashSet<string> Blacklist = new HashSet<string>();
private static ConcurrentBag<string> IDList = new ConcurrentBag<string>();

public static void AddIDToIDList(string ID)
{
    if (Blacklist.Contains(ID))
    {
        return;
    }

    IDList.Add(ID);
}
like image 110
Kirill Polishchuk Avatar answered Dec 06 '25 16:12

Kirill Polishchuk


Read operations are thread safe on HashSet, as long as Blacklist is not being modified you don't need to lock on it. Also you should lock inside the blacklist check so the lock is taken less often, this also will increase your performance.

private static HashSet<string> Blacklist = new HashSet<string>();
private static List<string> IDList = new List<string>();

public static void AddIDToIDList(string ID)
{
    if (IsIDBlacklisted(ID))
        return;
    lock (IDList)
    {
        IDList.Add(ID);
    }
}
public static bool IsIDBlacklisted(string ID)
{
    return Blacklist.Contains(ID);
}

If Blacklist is being modified the best way to lock around it is using a ReaderWriterLock (use the slim version if you are using a newer .NET)

private static HashSet<string> Blacklist = new HashSet<string>();
private static List<string> IDList = new List<string>();
private static ReaderWriterLockSlim BlacklistLock = new ReaderWriterLockSlim();

public static void AddIDToIDList(string ID)
{
    if (IsIDBlacklisted(ID))
        return;
    lock (IDList)
    {
        IDList.Add(ID);
    }
}
public static bool IsIDBlacklisted(string ID)
{
    BlacklistLock.EnterReadLock();
    try
    {
        return Blacklist.Contains(ID);
    }
    finally
    {
        BlacklistLock.ExitReadLock();
    }
}

public static bool AddToIDBlacklist(string ID)
{
    BlacklistLock.EnterWriteLock();
    try
    {
        return Blacklist.Add(ID);
    }
    finally
    {
        BlacklistLock.ExitWriteLock();
    }
}
like image 20
Scott Chamberlain Avatar answered Dec 06 '25 18:12

Scott Chamberlain