I have a function in which I normalize first N columns of a DataFrame. I want to return the normalized DataFrame, but leave the original alone. Yet, it seems like the function mutates the passed DataFrame as well!
using DataFrames
function normalize(input_df::DataFrame, cols::Array{Int})
norm_df = input_df
for i in cols
norm_df[i] = (input_df[i] - minimum(input_df[i])) /
(maximum(input_df[i]) - minimum(input_df[i]))
end
norm_df
end
using RDatasets
iris = dataset("datasets", "iris")
println("original df:\n", head(iris))
norm_df = normalize(iris, [1:4]);
println("should be the same:\n", head(iris))
Output:
original df:
6x5 DataFrame
| Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species |
|-----|-------------|------------|-------------|------------|----------|
| 1 | 5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
| 2 | 4.9 | 3.0 | 1.4 | 0.2 | "setosa" |
| 3 | 4.7 | 3.2 | 1.3 | 0.2 | "setosa" |
| 4 | 4.6 | 3.1 | 1.5 | 0.2 | "setosa" |
| 5 | 5.0 | 3.6 | 1.4 | 0.2 | "setosa" |
| 6 | 5.4 | 3.9 | 1.7 | 0.4 | "setosa" |
should be the same:
6x5 DataFrame
| Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species |
|-----|-------------|------------|-------------|------------|----------|
| 1 | 0.222222 | 0.625 | 0.0677966 | 0.0416667 | "setosa" |
| 2 | 0.166667 | 0.416667 | 0.0677966 | 0.0416667 | "setosa" |
| 3 | 0.111111 | 0.5 | 0.0508475 | 0.0416667 | "setosa" |
| 4 | 0.0833333 | 0.458333 | 0.0847458 | 0.0416667 | "setosa" |
| 5 | 0.194444 | 0.666667 | 0.0677966 | 0.0416667 | "setosa" |
| 6 | 0.305556 | 0.791667 | 0.118644 | 0.125 | "setosa" |
Julia uses a behaviour known as "pass-by-sharing". From the docs (emphasis mine):
Julia function arguments follow a convention sometimes called “pass-by-sharing”, which means that values are not copied when they are passed to functions. Function arguments themselves act as new variable bindings (new locations that can refer to values), but the values they refer to are identical to the passed values. Modifications to mutable values (such as Arrays) made within a function will be visible to the caller. This is the same behavior found in Scheme, most Lisps, Python, Ruby and Perl, among other dynamic languages.
In your particular case, what you appear to want to do is create an entirely new and independent DataFrame for your normalize operation. There are two operations for doing this: copy
and deepcopy
. If the element type of all your DataFrame columns are immutable (e.g. Int
, Float64
, String
, e.t.c.), then copy
is sufficient. However, if one of the columns contains a mutable type, then you will need to use deepcopy
. The function calls look like this:
norm_df = copy(input_df) # Column types are immutable
norm_df = deepcopy(input_df) # At least one column type is mutable
Julia typically will require you to do these sorts of things explicitly since creating an independent copy of a large data-frame can be computationally expensive and Julia is a performance-oriented language.
For those who want more detail on the difference between copy
and deepcopy
, then, again from the docs, note the following:
copy(x)
: Create a shallow copy of x: the outer structure is copied, but not all internal values. For example, copying an array produces a new array with identically-same elements as the original.
deepcopy(x)
: Create a deep copy of x: everything is copied recursively, resulting in a fully independent object. For example, deep-copying an array produces a new array whose elements are deep-copies of the original elements.
Type DataFrame
is array-like, and so deepcopy
is necessary if the elements are mutable. If you're unsure, use deepcopy
(although it will be slower).
A related SO question is here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With