Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What data structure to use for big data

I have an excel sheet with a million rows. Each row has 100 columns. Each row represents an instance of a class with 100 attributes, and the columns values are the values for these attributes.

What data structure is the most optimal for use here, to store the million instance of data?

Thanks

like image 409
London guy Avatar asked Dec 19 '25 11:12

London guy


2 Answers

It really depends on how you need to access this data and what you want to optimize for – like, space vs. speed.

  • If you want to optimize for space, well, you could just serialize and compress the data, but that would likely be useless if you need to read/manipulate the data.
  • If you access by index, the simplest thing is an array of arrays.
  • If you instead use an array of objects, where each object holds your 100 attributes, you have a better way to structure your code (encapsulation!)
  • If you need to query/search the data, it really depends on the kind of queries. You may want to have a look at BST data structures...
like image 75
alienhard Avatar answered Dec 22 '25 06:12

alienhard


One million rows with 100 values where is each value uses 8 bytes of memory is only 800 MB which will easily fit into the memory of most PC esp if they are 64-bit. Try to make the type of each column as compact as possible.

A more efficient way of storing the data is by column. i.e. you have array for each column with a primitive data type. I suspect you don't even need to do this.

If you have many more rows e.g. billions, you can use off heap memory i.e. memory mapped files and direct memory. This can efficient store more data than you have main memory while keeping you heap relatively small. (e.g. 100s of GB off-heap with 1 GB in heap)

like image 31
Peter Lawrey Avatar answered Dec 22 '25 05:12

Peter Lawrey



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!