Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are methods instantiated with classes, consuming lots of memory (in Scala)?

Situation

I am going to build a program (in Scala or Python - not yet decided) that is intensive on data manipulation. I see two mayor approaches:

  1. Approach: Define a collection of the data. Write my function. Send the entire dataset through the function.
  2. Approach: Define a data class that represents a single data entity and code the method (class member) into the data class. Parts of the method that should be flexible are send to the method via Scala Function or Python lambda.

Side question

I am not sure but the first approach might be more functional programming like, the second more OOP, is that right? By the way, I love both Functional Programming and OOP (some say they are opposites of each other, but Odersky tried his best to disprove that with Scala).

Main question

I prefer the second approach, because

  1. It seems more concise to me.
  2. It makes it easier to distribute the program in a shared-nothing architecture Big Data setting, since it brings functionality to the data and not data to functionality following the principle of data locality.

However, I worry that if I have a lot of data (and I do), I will have a lot of memory consumption because the method might have to be instantiated so many times.

  1. Question: Is that true for Scala/JVM - if not, how is it solved?
  2. Question: Is that true for Python - if not, how is it solved?

Follow-up question

Leading me up to: Which approach should I choose?

More Context

  • I have lot of data (millions, potentially billions of data objects)
  • I have not that many functions to implement. To give a ballpark figure, let's say about 10.
  • I expect a lot of calls to the methods though.
  • Let's say I have 100 calls per data entity, then I would have 100 * 1 million calls for the entire program.
  • My data class represents a single entity, not the entire dataset.
  • My worry is that with each instantiation of my DataObject class, the code of the method needs to be duplicated, which would cost a lot of memory and processing power. I have no idea how the internals of the JVM and Python work in this regard and whether that is true - that is what I am asking.

Here a crude DataObject class:

class DataObject {

    List datavalues

    def mymethod(){
        ...
    }
}
like image 368
Make42 Avatar asked Oct 29 '22 20:10

Make42


1 Answers

Which approach is best depends entirely on your problem. If you have only few operations, functions are simpler. If you have many operations which depend on the type/features of data, classes are efficient.

Personally, I prefer having classes for the same type of data to improve abstraction and modularity. Basically, using classes requires you to think about what your data is like, what is allowed on it and what is appropriate. It enforces that you separate, compartmentalize and understand what you are doing. Once you've done that, you can treat them like black boxes that just work.

I've seen many data-analysis programs fail because they just had functions working on arbitrary data. At first, it was simple computations. Then state needed to be preserved/cached, so data got appended or modified directly. Then someone realized that if you did x before you shouldn't do y later, so all sorts of flags, fields and other things get tacked on, which only functions a, b and d understood. Then someone added function f which extended on that, while someone added function k which extended it differently. That creates a cluster-foo that's impossible to understand, maintain, or trust in creating results.

So if you are unsure, do classes. You'll be happier in the end.


Concerning your second question, I can only answer that for python. However, many languages do it similarly.

Regular methods in python are defined on the class and created with it. That means the actual function represented by a method is shared by all instances, without memory overhead. Basically, a bare instance is just a wrapped reference to the class, from which methods are fetched. Only things exclusive to an instance, like data, add to memory notably.

Calling a method does add some overhead, because the method gets bound to the instance - basically, the function is fetched from the class and the first parameter self gets bound. This technically incurs some overhead.

# Method Call
$ python -m timeit -s 'class Foo():' -s ' def p(self):' -s '  pass' -s 'foo = Foo()' 'foo.p()'
10000000 loops, best of 3: 0.158 usec per loop
# Method Call of cached method
$ python -m timeit -s 'class Foo():' -s ' def p(self):' -s '  pass' -s 'foo = Foo()' -s 'p=foo.p' 'p()'
10000000 loops, best of 3: 0.0984 usec per loop
# Function Call
$ python -m timeit -s 'def p():' -s ' pass' 'p()'
10000000 loops, best of 3: 0.0846 usec per loop

However, practically any operation does this; you'll only notice the added overhead if your applications does nothing else but call your method, and the method also does nothing.

I've also seen people write data analysis applications with so many levels of abstraction that in fact they mostly just called methods/functions. This is a smell of writing code in general, not whether to use methods or functions.

So if you are unsure, do classes. You'll be happier in the end.

like image 96
MisterMiyagi Avatar answered Nov 11 '22 14:11

MisterMiyagi