I often want to loop over a long array or column of a dataframe, and for each item, see if it is a member of another array. Rather than doing
giant_list = ["a", "c", "j"]
good_letters = ["a", "b"]
isin = falses(size(giant_list,1))
for i=1:size(giant_list,1)
isin[i] = giant_list[i] in good_letters
end
Is there any vectorized (doubly-vectorized?) way to do this in julia? In analogy with the basic operators I want to do something like
isin = giant_list .in good_letters
I realize this may not be possible, but I just wanted to make sure I wasn't missing something. I know I could probably use DefaultDict from DataStructures to do the similar but don't know of anything in base.
Vectorized functions usually refer to those that take a vector and operate on the entire vector in an efficient way. Ultimately this will involve some form of loop, but as that loop is being performed in a low-level language such as C it can be highly efficient and tailored to the particular task.
A return type can be specified in the function declaration using the :: operator. This converts the return value to the specified type. This function will always return an Int8 regardless of the types of x and y .
'return' keyword in Julia is used to return the last computed value to the caller function. This keyword will cause the enclosing function to exit once the value is returned. return keyword will make the function to exit immediately and the expressions after the return statement will not be executed.
Every function in Julia is a generic function. A generic function is conceptually a single function, but consists of many definitions, or methods. The methods of a generic function are stored in a method table. Method tables (type MethodTable ) are associated with TypeName s.
The indexin
function does something similar to what you want:
indexin(a, b)
Returns a vector containing the highest index in
b
for each value ina
that is a member ofb
. The output vector contains 0 wherevera
is not a member ofb
.
Since you want a boolean for each element in your giant_list
(instead of the index in good_letters
), you can simply do:
julia> indexin(giant_list, good_letters) .> 0
3-element BitArray{1}:
true
false
false
The implementation of indexin
is very straightforward, and points the way to how you might optimize this if you don't care about the indices in b
:
function vectorin(a, b)
bset = Set(b)
[i in bset for i in a]
end
Only a limited set of names may be used as infix operators, so it's not possible to use it as an infix operator.
There are a handful of modern (i.e. Julia v1.0) solutions to this problem:
First, an update to the scalar strategy. Rather than using a 1-element tuple or array, scalar broadcasting can be achieved using a Ref
object:
julia> in.(giant_list, Ref(good_letters))
3-element BitArray{1}:
true
false
false
This same result can be achieved by broadcasting the infix ∈
(\in
TAB) operator:
julia> giant_list .∈ Ref(good_letters)
3-element BitArray{1}:
true
false
false
Additionally, calling in
with one argument creates a Base.Fix2
, which may later be applied via a broadcasted call. This seems to have limited benefits compared to simply defining a function, though.
julia> is_good1 = in(good_letters);
is_good2(x) = x in good_letters;
julia> is_good1.(giant_list)
3-element BitArray{1}:
true
false
false
julia> is_good2.(giant_list)
3-element BitArray{1}:
true
false
false
All in all, using .∈
with a Ref
will probably lead to the shortest, cleanest code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With