Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use PyCall in Julia to convert Python output to Julia DataFrame

I would like to retrieve some data from quandl and analyse them in Julia. There is, unfortunately, no official API available for this (yet). I am aware of this solution, but it is still quite limited in functionality and doesn't follow the same syntax as the original Python API.

I thought it would be a smart thing to use PyCall to retrieve the data using the official Python API from within Julia. This does yield an output, but I'm not sure how I can convert it to a format that I would be able to use within Julia (ideally a DataFrame).

I have tried the following.

using PyCall, DataFrames
@pyimport quandl

data = quandl.get("WIKI/AAPL", returns = "pandas");

Julia converts this output to a Dict{Any,Any}. When using returns = "numpy" instead of returns = "pandas", I end up with a PyObject rec.array.

How can I get data to be a Julia DataFrame as quandl.jl would return it? Note that quandl.jl is not an option for me because it doesn't support automatic retrieval of multiple assets and lacks several other features, so it's essential that I can use the Python API.

Thank you for any suggestions!

like image 939
Constantin Avatar asked Mar 06 '17 01:03

Constantin


2 Answers

Here's one option:

First, extract the column names from you data object:

julia> colnames = map(Symbol, data[:columns]);
12-element Array{Symbol,1}:
 :Open                
 :High                
 :Low                 
 :Close               
 :Volume              
 Symbol("Ex-Dividend")
 Symbol("Split Ratio")
 Symbol("Adj. Open")  
 Symbol("Adj. High")  
 Symbol("Adj. Low")   
 Symbol("Adj. Close") 
 Symbol("Adj. Volume")

Then pour all your columns into an DataFrame:

julia> y = DataFrame(Any[Array(data[c]) for c in colnames], colnames)

6×12 DataFrames.DataFrame
│ Row │ Open  │ High  │ Low   │ Close │ Volume   │ Ex-Dividend │ Split Ratio │
├─────┼───────┼───────┼───────┼───────┼──────────┼─────────────┼─────────────┤
│ 1   │ 28.75 │ 28.87 │ 28.75 │ 28.75 │ 2.0939e6 │ 0.0         │ 1.0         │
│ 2   │ 27.38 │ 27.38 │ 27.25 │ 27.25 │ 785200.0 │ 0.0         │ 1.0         │
│ 3   │ 25.37 │ 25.37 │ 25.25 │ 25.25 │ 472000.0 │ 0.0         │ 1.0         │
│ 4   │ 25.87 │ 26.0  │ 25.87 │ 25.87 │ 385900.0 │ 0.0         │ 1.0         │
│ 5   │ 26.63 │ 26.75 │ 26.63 │ 26.63 │ 327900.0 │ 0.0         │ 1.0         │
│ 6   │ 28.25 │ 28.38 │ 28.25 │ 28.25 │ 217100.0 │ 0.0         │ 1.0         │

│ Row │ Adj. Open │ Adj. High │ Adj. Low │ Adj. Close │ Adj. Volume │
├─────┼───────────┼───────────┼──────────┼────────────┼─────────────┤
│ 1   │ 0.428364  │ 0.430152  │ 0.428364 │ 0.428364   │ 1.17258e8   │
│ 2   │ 0.407952  │ 0.407952  │ 0.406015 │ 0.406015   │ 4.39712e7   │
│ 3   │ 0.378004  │ 0.378004  │ 0.376216 │ 0.376216   │ 2.6432e7    │
│ 4   │ 0.385453  │ 0.38739   │ 0.385453 │ 0.385453   │ 2.16104e7   │
│ 5   │ 0.396777  │ 0.398565  │ 0.396777 │ 0.396777   │ 1.83624e7   │
│ 6   │ 0.420914  │ 0.422851  │ 0.420914 │ 0.420914   │ 1.21576e7   │

Thanks to @Matt B. for the suggestions to simplify the code.

The problem with the above is that the column types are Any inside the dataframe. To make it a bit more efficient here are a few functions which get the job done:

# first, guess the Julia equivalent of type of the object
function guess_type(x::PyCall.PyObject)
  string_dtype = x[:dtype][:name]
  julia_string = string(uppercase(string_dtype[1]), string_dtype[2:end])

  return eval(parse("$julia_string"))
end

# convert an individual column, falling back to Any array if the guess was wrong
function convert_column(x)
  y = try Array{guess_type(x)}(x) catch Array(x) end
  return y
end

# put everything together into a single function
function convert_pandas(df)
  colnames =  map(Symbol, data[:columns])
  y = DataFrame(Any[convert_column(df[c]) for c in colnames], colnames)

  return y
end

The above, when applied to your data gives the same column names as before, but with correct Float64 column types:

y = convert_pandas(data);
showcols(y)
9147×12 DataFrames.DataFrame
│ Col # │ Name        │ Eltype  │ Missing │
├───────┼─────────────┼─────────┼─────────┤
│ 1     │ Open        │ Float64 │ 0       │
│ 2     │ High        │ Float64 │ 0       │
│ 3     │ Low         │ Float64 │ 0       │
│ 4     │ Close       │ Float64 │ 0       │
│ 5     │ Volume      │ Float64 │ 0       │
│ 6     │ Ex-Dividend │ Float64 │ 0       │
│ 7     │ Split Ratio │ Float64 │ 0       │
│ 8     │ Adj. Open   │ Float64 │ 0       │
│ 9     │ Adj. High   │ Float64 │ 0       │
│ 10    │ Adj. Low    │ Float64 │ 0       │
│ 11    │ Adj. Close  │ Float64 │ 0       │
│ 12    │ Adj. Volume │ Float64 │ 0       │
like image 94
niczky12 Avatar answered Sep 20 '22 23:09

niczky12


You're running into a difference in Python/Pandas versions. I happen to have two configurations easily available to me; Pandas 0.18.0 in Python 2 and Pandas 0.19.1 in Python 3. The answer @niczky12 provided works well in the first configuration, but I'm seeing your Dict{Any,Any} behavior in the second configuration. Basically, something changes between those two configurations such that PyCall detects a mapping-like interface for Pandas objects and then exposes that interface as a dictionary through an automatic conversion. There are two options here:

  1. Work with the dictionary interface:

    data = quandl.get("WIKI/AAPL", returns = "pandas")
    cols = keys(data)
    df = DataFrame(Any[collect(values(data[c])) for c in cols], map(Symbol, cols))
    
  2. Explicitly disable the auto-conversion and use the PyCall interface to extract the columns as niczky12 demonstrated in the other answer. Note that data[:Open] will do auto-conversion to a mapped dictionary and data["Open"] will just return a PyObject.

    data = pycall(quandl.get, PyObject, "WIKI/AAPL", returns = "pandas")
    cols = data[:columns]
    df = DataFrame(Any[Array(data[c]) for c in cols], map(Symbol, cols))
    

In both cases, though, note that the all-important date index isn't included in the resulting data frame. You almost certainly want to add that as a column:

df[:Date] = collect(data[:index])
like image 22
mbauman Avatar answered Sep 19 '22 23:09

mbauman