Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Database VS. Multiple Flat Files (Thousands of small CSV's)

We are designing an update to a current system (C++\CLI and C#). The system will gather small (~1Mb) amounts of data from ~10K devices (in the near future). Currently, they are used to save device data in a CSV (a table) and store all these in a wide folder structure.

Data is only inserted (create / append to a file, create folder) never updated / removed. Data processing is done by reading many CSV's to an external program (like Matlab). Mainly be used for statistical analysis.

There is an option to start saving this data to an MS-SQL database. Process time (reading the CSV's to external program) could be up to a few minutes.

  • How should we choose which method to use?
  • Does one of the methods take significantly more storage than the other?
  • Roughly, when does reading the raw data from a database becomes quicker than reading the CSV's? (10 files, 100 files? ...)

I'd appreciate your answers, Pros and Cons are welcome.

Thank you for your time.

like image 410
Oren Avatar asked Oct 07 '22 11:10

Oren


2 Answers

Well if you are using data in one CSV to get data in another CSV I would guess that SQL Server is going to be faster than whatever you have come up with. I suspect SQL Server would be faster in most cases, but I can't say for sure. Microsoft has put a lot of resources into make a DBMS that does exactly what you are trying to do.

Based on your description it sounds like you have almost created your own DBMS based on table data and folder structure. I suspect that if you switched to using SQL Server you would probably find a number of areas where things are faster and easier.

Possible Pros:

  • Faster access
  • Easier to manage
  • Easier to expand should you need to
  • Easier to enforce data integrity
  • Easier to design more complex relationships

Possible Cons:

  • You would have to rewrite your existing code to use SQL Server instead of your current system
  • You may have to pay for SQL Server, you would have to check to see if you can use Express

Good luck!

like image 151
Abe Miessler Avatar answered Oct 19 '22 21:10

Abe Miessler


I'd like to try hitting those questions a bit out of order.

Roughly, when does reading the raw data from a database becomes quicker than reading the CSV's? (10 files, 100 files? ...)

Immediately. The database is optimized (assuming you've done your homework) to read data out at incredible rates.

Does one of the methods take significantly more storage than the other?

Until you're up in the tens of thousands of files, it probably won't make too much of a difference. Space is cheap, right? However, once you get into the big leagues, you'll notice that the DB is taking up much, much less space.

How should we choose which method to use?

Great question. Everything in the database always comes back to scalability. If you had only a single CSV file to read, you'd be good to go. No DB required. Even dozens, no problem.

It looks like you could end up in a position where you scale up to levels where you'll definitely want the DB engine behind your data pretty quickly. When in doubt, creating a database is the safe bet, since you'll still be able to query that 100 GB worth of data in a second.

like image 30
Nick Vaccaro Avatar answered Oct 19 '22 21:10

Nick Vaccaro