Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lots of small files or a couple huge ones?

In terms of performance and efficiency, is it better to use lots of small files (by lots I mean as much as a few million) or a couple (ten or so) huge (several gigabyte) files? Let's just say I'm building a database (not entirely true, but all that matters is that it's going to be accessed a LOT).

I'm mainly concerned with read performance. My filesystem is currently ext3 on Linux (Ubuntu Server Edition if it matters), although I'm in a position where I can still switch, so comparisons between different filesystems would be fabulous. For technical reasons I can't use an actual DBMS for this (hence the question), so "just use MySQL" is not a good answer.

Thanks in advance, and let me know if I need to be more specific.


EDIT: I'm going to be storing lots of relatively small pieces of data, which is why using lots of small files would be easier for me. So if I went with using a few large files, I'd only be retrieving a few KB out of them at a time. I'd also be using an index, so that's not really a problem. Also, some of the data points to other pieces of data (it would point to the file in the lots-of-small-files case, and point to the data's location within the file in the large-files case).

like image 620
Sasha Chedygov Avatar asked Dec 10 '22 20:12

Sasha Chedygov


2 Answers

There are a lot of assumptions here but, for all intents and purposes, searching through a large file will much be quicker than searching through a bunch of small files.

Let's say you are looking for a string of text contained in a text file. Searching a 1TB file will be much faster than opening 1,000,000 MB files and searching through those.

Each file-open operation takes time. A large file only has to be opened once.

And, in considering disk performance, a single file is much more likely to be stored contiguously than a large series of files.

...Again, these are generalizations without knowing more about your specific application.

like image 155
Robert Cartaino Avatar answered Mar 16 '23 21:03

Robert Cartaino


It depends. really. Different filesystems are optimized in a different way, but in general, small files are packed efficiently. The advantage of having large files is that you don't have to open and close a lot of stuff. open and close are operations that take time. If you have a large file, you normally open and close only once and you use seek operations

If you go for the lots-of-files solution, I suggest you a structure like

b/a/bar
b/a/baz
f/o/foo

because you have limits on the number of files in a directory.

like image 29
Stefano Borini Avatar answered Mar 16 '23 22:03

Stefano Borini