I have a query that returns me around 6 million rows, which is too big to process all at once in memory. Each query is returning a Tuple3[String, Int, java.sql.Timestamp]. I know the string is never more than about 20 characters, UTF8. How can I work out the max size of one of these tuples, and more generally, how can I approximate the size of a scala data-structure like this? I've got 6Gb on the machine I'm using. However, the data is being read from the database using scala-query into scala's Lists.

Scala objects follow approximately the same rules as Java objects, so any information on those is accurate. Here is one source, which seems at least mostly right for 32 bit JVMs. (64 bit JVMs use 8 bytes per pointer, which generally works out to 4 bytes extra overhead plus 4 bytes per pointer--but there may be less if the JVM is using compressed pointers, which it does by default now, I think.) I'll assume a 64 bit machine without compressed pointers (worst case); then a <code>Tuple3</code> has two pointers (16 bytes) plus an <code>Int</code> (4 bytes) plus object overhead (~12 bytes) rounded to the nearest 8, or 32 bytes, plus an extra object (8 bytes) as a stub for the non-specialized version of <code>Int</code>. (Sadly, if you use primitives in tuples they take even more space than when you use wrapped versions.). <code>String</code> is 32 bytes, IIRC, plus the array for the data which is 16 plus 2 per character. <code>java.sql.Timestamp</code> needs to store a couple of <code>Long</code>s (I think it is), so that's 32 bytes. All told, it's on the order of 120 bytes plus two per character, which at ~20 characters is ~160 bytes. Alternatively, see this answer for a way to measure the size of your objects directly. When I measure it this way, I get 160 bytes (and my estimate above has been corrected using this data so it matches; I had several small errors before).

How much memory have you got at your disposal? 6 million instances of a triple is really not very much! Each reference has an overhead which is either 4 or 8 bytes, dependent on whether you are running 32- or 64-bit (without compressed "oops", although this is the default in JDK7 for heaps under 32Gb). So your triple has 3 references (there may be extra ones due to specialisation - so you might get 4 refs), your <code>Timestamp</code> is a wrapper (reference) around a <code>long</code> (8 bytes). Your <code>Int</code> will be specialized (i.e. an underlying <code>int</code>), so this makes another 4 bytes. The String is 20 x 2 bytes. So you basically have a worst case of well under 100 bytes per row; so 10 rows per kb, 10,000 rows per Mb. So you can comfortably process your 6 million rows in under 1 Gb of heap. Frankly, I think I've made a mistake here because we process daily several million rows of about twenty fields (including decimals, Strings etc) comfortably in this space.

How can I approximate the size of a data structure in scala?

2 Answers

Scala objects follow approximately the same rules as Java objects, so any information on those is accurate. Here is one source, which seems at least mostly right for 32 bit JVMs. (64 bit JVMs use 8 bytes per pointer, which generally works out to 4 bytes extra overhead plus 4 bytes per pointer--but there may be less if the JVM is using compressed pointers, which it does by default now, I think.)

I'll assume a 64 bit machine without compressed pointers (worst case); then a Tuple3 has two pointers (16 bytes) plus an Int (4 bytes) plus object overhead (~12 bytes) rounded to the nearest 8, or 32 bytes, plus an extra object (8 bytes) as a stub for the non-specialized version of Int. (Sadly, if you use primitives in tuples they take even more space than when you use wrapped versions.). String is 32 bytes, IIRC, plus the array for the data which is 16 plus 2 per character. java.sql.Timestamp needs to store a couple of Longs (I think it is), so that's 32 bytes. All told, it's on the order of 120 bytes plus two per character, which at ~20 characters is ~160 bytes.

Alternatively, see this answer for a way to measure the size of your objects directly. When I measure it this way, I get 160 bytes (and my estimate above has been corrected using this data so it matches; I had several small errors before).

182

answered Sep 19 '22 13:09

Rex Kerr

How much memory have you got at your disposal? 6 million instances of a triple is really not very much!

Each reference has an overhead which is either 4 or 8 bytes, dependent on whether you are running 32- or 64-bit (without compressed "oops", although this is the default in JDK7 for heaps under 32Gb).

So your triple has 3 references (there may be extra ones due to specialisation - so you might get 4 refs), your Timestamp is a wrapper (reference) around a long (8 bytes). Your Int will be specialized (i.e. an underlying int), so this makes another 4 bytes. The String is 20 x 2 bytes. So you basically have a worst case of well under 100 bytes per row; so 10 rows per kb, 10,000 rows per Mb. So you can comfortably process your 6 million rows in under 1 Gb of heap.

Frankly, I think I've made a mistake here because we process daily several million rows of about twenty fields (including decimals, Strings etc) comfortably in this space.

answered Sep 20 '22 13:09

oxbow_lakes

Related questions
                            
                                SQL Efficiency - Query using dateAdd Function twice; or SubQuery and DateAdd Function once; on Date BETWEEN
                            
                                Speeding up the execution of C#/.NET application [closed]
                            
                                ASP.Net MVC Razor Views - Minifying HTML at build time
                            
                                How to configure Hibernate statistics in Spring 3.0 application?
                            
                                Java garbage collection "real" time is much bigger than "user" +"system"
                            
                                Python string manipulation -- performance problems
                            
                                How to improve the performance of ViewFlipper/ViewAnimator
                            
                                How to avoid data loss on server failure with MongoDB on a single machine?
                            
                                How was the cor() function sped up?
                            
                                SQLite vs serializing to disk
                            
                                Optimize my performance
                            
                                MySQL slow group by/order by
                            
                                Faster way of counting total number of columns in a cassandra row with hector
                            
                                CSS Preprocessor or PHP?
                            
                                Performance of DataInputStream\DataOutputStream
                            
                                How to disable IE8 script error message?
                            
                                Is it costly in Python to put classes in different files?
                            
                                .NET Portable library missing BitConverter.DoubleToInt64Bits, replacement very slow
                            
                                ASP.NET startup Performance profiling web
                            
                                Finding the balance point in an array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I approximate the size of a data structure in scala?

Tags:

performance

jvm

scala

Squidly

People also ask

2 Answers

Rex Kerr

oxbow_lakes

Recent Activity

Donate For Us