Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorized "in" function in julia?

I often want to loop over a long array or column of a dataframe, and for each item, see if it is a member of another array. Rather than doing

giant_list = ["a", "c", "j"]
good_letters = ["a", "b"]
isin = falses(size(giant_list,1))
for i=1:size(giant_list,1)
    isin[i] = giant_list[i] in good_letters
end

Is there any vectorized (doubly-vectorized?) way to do this in julia? In analogy with the basic operators I want to do something like

isin = giant_list .in good_letters

I realize this may not be possible, but I just wanted to make sure I wasn't missing something. I know I could probably use DefaultDict from DataStructures to do the similar but don't know of anything in base.

like image 426
ARM Avatar asked Apr 15 '15 21:04

ARM


People also ask

What is vectorized function?

Vectorized functions usually refer to those that take a vector and operate on the entire vector in an efficient way. Ultimately this will involve some form of loop, but as that loop is being performed in a low-level language such as C it can be highly efficient and tailored to the particular task.

What is :: In Julia?

A return type can be specified in the function declaration using the :: operator. This converts the return value to the specified type. This function will always return an Int8 regardless of the types of x and y .

How do you return a function in Julia?

'return' keyword in Julia is used to return the last computed value to the caller function. This keyword will cause the enclosing function to exit once the value is returned. return keyword will make the function to exit immediately and the expressions after the return statement will not be executed.

What is the type of a function in Julia?

Every function in Julia is a generic function. A generic function is conceptually a single function, but consists of many definitions, or methods. The methods of a generic function are stored in a method table. Method tables (type MethodTable ) are associated with TypeName s.


2 Answers

The indexin function does something similar to what you want:

indexin(a, b)

Returns a vector containing the highest index in b for each value in a that is a member of b. The output vector contains 0 wherever a is not a member of b.

Since you want a boolean for each element in your giant_list (instead of the index in good_letters), you can simply do:

julia> indexin(giant_list, good_letters) .> 0
3-element BitArray{1}:
  true
 false
 false

The implementation of indexin is very straightforward, and points the way to how you might optimize this if you don't care about the indices in b:

function vectorin(a, b)
    bset = Set(b)
    [i in bset for i in a]
end

Only a limited set of names may be used as infix operators, so it's not possible to use it as an infix operator.

like image 153
mbauman Avatar answered Oct 21 '22 14:10

mbauman


There are a handful of modern (i.e. Julia v1.0) solutions to this problem:

First, an update to the scalar strategy. Rather than using a 1-element tuple or array, scalar broadcasting can be achieved using a Ref object:

julia> in.(giant_list, Ref(good_letters))
3-element BitArray{1}:
  true
 false
 false

This same result can be achieved by broadcasting the infix (\inTAB) operator:

julia> giant_list .∈ Ref(good_letters)
3-element BitArray{1}:
  true
 false
 false

Additionally, calling in with one argument creates a Base.Fix2, which may later be applied via a broadcasted call. This seems to have limited benefits compared to simply defining a function, though.

julia> is_good1 = in(good_letters);
       is_good2(x) = x in good_letters;

julia> is_good1.(giant_list)
3-element BitArray{1}:
  true
 false
 false

julia> is_good2.(giant_list)
3-element BitArray{1}:
  true
 false
 false

All in all, using .∈ with a Ref will probably lead to the shortest, cleanest code.

like image 32
Harrison Grodin Avatar answered Oct 21 '22 12:10

Harrison Grodin