Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to read files in a multi-processing environment? C#

I have the following challenge:

I have a Azure Cloud Worker Role with many instances. Every minute, each instance spins up about 20-30 threads. In each thread, it needs to read some metadata about how to process the thread from 3 objects. The objects/data reside in a remote RavenDb and even though RavenDb is very fast at retrieving the objects via HTTP, it is still under a considerable load from 30+ workers that are hitting it 3 times per thread per minute (about 45 requests/sec). Most of the time (like 99.999%) the data in RavenDb does not change.

I've decided to implement local storage caching. First, I read a tiny record which indicates if the metadata has changed (it changes VERY rarely), and then I read from local file storage instead of RavenDb, if local storage has the object cached. I'm using File.ReadAllText()

This approach appears to be bogging the machine down and procesing slows down considerably. I'm guessing the disks on "Small" Worker Roles are not fast enough.

Is there anyway, I can have OS help me out and cache those files? Perhaps there is an alternative to caching of this data?

I'm looking at about ~1000 files of varying sizes ranging from 100k to 10mb in size stored on each Cloud Role instance

like image 542
Igorek Avatar asked Dec 21 '16 23:12

Igorek


1 Answers

Not a straight answer, but three possible options:

Use the built-in RavenDB caching mechanism

My initial guess is that your caching mechanism is actually hurting performance. The RavenDB client has caching built-in (see here for how to fine-tune it: https://ravendb.net/docs/article-page/3.5/csharp/client-api/how-to/setup-aggressive-caching)

The problem you have is that the cache is local to each server. If server A downloaded a file before, server B will still have to fetch it if it happens to process that file the next time.

One possible option you could implement is divide the workload. For example:

  • Server A => fetch files that start with A-D
  • Server B => fetch files that start with E-H
  • Server C => ...

This would ensure that you optimize the cache on each server.

Get a bigger machine

If you still want to employ your own caching mechanism, there are two things that I imagine could be the bottleneck:

  • Disk access
  • Deserialization of the JSON

For these issues, the only thing I can imagine would be to get bigger resources:

  • If it's the disk, use premium storage with SSD's.
  • If it's deserialization, get VM's with a bigger CPU

Cache files in RAM

Alternatively, instead of writing the files to disk, store them in memory and get a VM with more RAM. You shouldn't need THAT much RAM, since 1000 files * 10MB is still just 1 GB. Doing this would eliminate disk access and deserialization.

But ultimately, it's probably best to first measure where the bottleneck is and see if it can be mitigated by using RavenDB's built-in caching mechanism.

like image 126
Kenneth Avatar answered Nov 01 '22 02:11

Kenneth