Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I approximate the size of a data structure in scala?

I have a query that returns me around 6 million rows, which is too big to process all at once in memory.

Each query is returning a Tuple3[String, Int, java.sql.Timestamp]. I know the string is never more than about 20 characters, UTF8.

How can I work out the max size of one of these tuples, and more generally, how can I approximate the size of a scala data-structure like this?

I've got 6Gb on the machine I'm using. However, the data is being read from the database using scala-query into scala's Lists.

like image 611
Squidly Avatar asked Jun 26 '12 14:06

Squidly


People also ask

How do I find the length of an array in Scala?

We can get the length of an array in Scala by using the length property. The length property returns the length of an array in Scala.

What is a data structure in Scala?

A Data Structure in scala is a type of data structure that aids in the storage, organisation, and retrieval of data in the field of computing in general. It is important to note that Scala distinguishes between immutable and mutable collection data types. Mutable collections are organise under the scala.


2 Answers

Scala objects follow approximately the same rules as Java objects, so any information on those is accurate. Here is one source, which seems at least mostly right for 32 bit JVMs. (64 bit JVMs use 8 bytes per pointer, which generally works out to 4 bytes extra overhead plus 4 bytes per pointer--but there may be less if the JVM is using compressed pointers, which it does by default now, I think.)

I'll assume a 64 bit machine without compressed pointers (worst case); then a Tuple3 has two pointers (16 bytes) plus an Int (4 bytes) plus object overhead (~12 bytes) rounded to the nearest 8, or 32 bytes, plus an extra object (8 bytes) as a stub for the non-specialized version of Int. (Sadly, if you use primitives in tuples they take even more space than when you use wrapped versions.). String is 32 bytes, IIRC, plus the array for the data which is 16 plus 2 per character. java.sql.Timestamp needs to store a couple of Longs (I think it is), so that's 32 bytes. All told, it's on the order of 120 bytes plus two per character, which at ~20 characters is ~160 bytes.

Alternatively, see this answer for a way to measure the size of your objects directly. When I measure it this way, I get 160 bytes (and my estimate above has been corrected using this data so it matches; I had several small errors before).

like image 182
Rex Kerr Avatar answered Sep 19 '22 13:09

Rex Kerr


How much memory have you got at your disposal? 6 million instances of a triple is really not very much!

Each reference has an overhead which is either 4 or 8 bytes, dependent on whether you are running 32- or 64-bit (without compressed "oops", although this is the default in JDK7 for heaps under 32Gb).

So your triple has 3 references (there may be extra ones due to specialisation - so you might get 4 refs), your Timestamp is a wrapper (reference) around a long (8 bytes). Your Int will be specialized (i.e. an underlying int), so this makes another 4 bytes. The String is 20 x 2 bytes. So you basically have a worst case of well under 100 bytes per row; so 10 rows per kb, 10,000 rows per Mb. So you can comfortably process your 6 million rows in under 1 Gb of heap.

Frankly, I think I've made a mistake here because we process daily several million rows of about twenty fields (including decimals, Strings etc) comfortably in this space.

like image 33
oxbow_lakes Avatar answered Sep 20 '22 13:09

oxbow_lakes