Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Julia: Passing a DataFrame to a function creates a pointer to the DataFrame?

I have a function in which I normalize first N columns of a DataFrame. I want to return the normalized DataFrame, but leave the original alone. Yet, it seems like the function mutates the passed DataFrame as well!

using DataFrames

function normalize(input_df::DataFrame, cols::Array{Int})
    norm_df = input_df
    for i in cols
        norm_df[i] = (input_df[i] - minimum(input_df[i])) / 
            (maximum(input_df[i]) - minimum(input_df[i]))
    end
    norm_df
end

using RDatasets
iris = dataset("datasets", "iris")
println("original df:\n", head(iris))

norm_df = normalize(iris, [1:4]);
println("should be the same:\n", head(iris))

Output:

original df:
6x5 DataFrame
| Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species  |
|-----|-------------|------------|-------------|------------|----------|
| 1   | 5.1         | 3.5        | 1.4         | 0.2        | "setosa" |
| 2   | 4.9         | 3.0        | 1.4         | 0.2        | "setosa" |
| 3   | 4.7         | 3.2        | 1.3         | 0.2        | "setosa" |
| 4   | 4.6         | 3.1        | 1.5         | 0.2        | "setosa" |
| 5   | 5.0         | 3.6        | 1.4         | 0.2        | "setosa" |
| 6   | 5.4         | 3.9        | 1.7         | 0.4        | "setosa" |

should be the same:
6x5 DataFrame
| Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species  |
|-----|-------------|------------|-------------|------------|----------|
| 1   | 0.222222    | 0.625      | 0.0677966   | 0.0416667  | "setosa" |
| 2   | 0.166667    | 0.416667   | 0.0677966   | 0.0416667  | "setosa" |
| 3   | 0.111111    | 0.5        | 0.0508475   | 0.0416667  | "setosa" |
| 4   | 0.0833333   | 0.458333   | 0.0847458   | 0.0416667  | "setosa" |
| 5   | 0.194444    | 0.666667   | 0.0677966   | 0.0416667  | "setosa" |
| 6   | 0.305556    | 0.791667   | 0.118644    | 0.125      | "setosa" |
like image 521
Anarcho-Chossid Avatar asked Jan 20 '15 03:01

Anarcho-Chossid


1 Answers

Julia uses a behaviour known as "pass-by-sharing". From the docs (emphasis mine):

Julia function arguments follow a convention sometimes called “pass-by-sharing”, which means that values are not copied when they are passed to functions. Function arguments themselves act as new variable bindings (new locations that can refer to values), but the values they refer to are identical to the passed values. Modifications to mutable values (such as Arrays) made within a function will be visible to the caller. This is the same behavior found in Scheme, most Lisps, Python, Ruby and Perl, among other dynamic languages.

In your particular case, what you appear to want to do is create an entirely new and independent DataFrame for your normalize operation. There are two operations for doing this: copy and deepcopy. If the element type of all your DataFrame columns are immutable (e.g. Int, Float64, String, e.t.c.), then copy is sufficient. However, if one of the columns contains a mutable type, then you will need to use deepcopy. The function calls look like this:

norm_df = copy(input_df)     # Column types are immutable
norm_df = deepcopy(input_df) # At least one column type is mutable

Julia typically will require you to do these sorts of things explicitly since creating an independent copy of a large data-frame can be computationally expensive and Julia is a performance-oriented language.

For those who want more detail on the difference between copy and deepcopy, then, again from the docs, note the following:

copy(x): Create a shallow copy of x: the outer structure is copied, but not all internal values. For example, copying an array produces a new array with identically-same elements as the original.

deepcopy(x): Create a deep copy of x: everything is copied recursively, resulting in a fully independent object. For example, deep-copying an array produces a new array whose elements are deep-copies of the original elements.

Type DataFrame is array-like, and so deepcopy is necessary if the elements are mutable. If you're unsure, use deepcopy (although it will be slower).

A related SO question is here.

like image 162
Colin T Bowers Avatar answered Sep 30 '22 18:09

Colin T Bowers