I would like to retrieve some data from quandl
and analyse them in Julia. There is, unfortunately, no official API available for this (yet). I am aware of this solution, but it is still quite limited in functionality and doesn't follow the same syntax as the original Python API.
I thought it would be a smart thing to use PyCall
to retrieve the data using the official Python API from within Julia. This does yield an output, but I'm not sure how I can convert it to a format that I would be able to use within Julia (ideally a DataFrame
).
I have tried the following.
using PyCall, DataFrames
@pyimport quandl
data = quandl.get("WIKI/AAPL", returns = "pandas");
Julia converts this output to a Dict{Any,Any}
. When using returns = "numpy"
instead of returns = "pandas"
, I end up with a PyObject rec.array
.
How can I get data
to be a Julia DataFrame
as quandl.jl
would return it? Note that quandl.jl
is not an option for me because it doesn't support automatic retrieval of multiple assets and lacks several other features, so it's essential that I can use the Python API.
Thank you for any suggestions!
Here's one option:
First, extract the column names from you data
object:
julia> colnames = map(Symbol, data[:columns]);
12-element Array{Symbol,1}:
:Open
:High
:Low
:Close
:Volume
Symbol("Ex-Dividend")
Symbol("Split Ratio")
Symbol("Adj. Open")
Symbol("Adj. High")
Symbol("Adj. Low")
Symbol("Adj. Close")
Symbol("Adj. Volume")
Then pour all your columns into an DataFrame:
julia> y = DataFrame(Any[Array(data[c]) for c in colnames], colnames)
6×12 DataFrames.DataFrame
│ Row │ Open │ High │ Low │ Close │ Volume │ Ex-Dividend │ Split Ratio │
├─────┼───────┼───────┼───────┼───────┼──────────┼─────────────┼─────────────┤
│ 1 │ 28.75 │ 28.87 │ 28.75 │ 28.75 │ 2.0939e6 │ 0.0 │ 1.0 │
│ 2 │ 27.38 │ 27.38 │ 27.25 │ 27.25 │ 785200.0 │ 0.0 │ 1.0 │
│ 3 │ 25.37 │ 25.37 │ 25.25 │ 25.25 │ 472000.0 │ 0.0 │ 1.0 │
│ 4 │ 25.87 │ 26.0 │ 25.87 │ 25.87 │ 385900.0 │ 0.0 │ 1.0 │
│ 5 │ 26.63 │ 26.75 │ 26.63 │ 26.63 │ 327900.0 │ 0.0 │ 1.0 │
│ 6 │ 28.25 │ 28.38 │ 28.25 │ 28.25 │ 217100.0 │ 0.0 │ 1.0 │
│ Row │ Adj. Open │ Adj. High │ Adj. Low │ Adj. Close │ Adj. Volume │
├─────┼───────────┼───────────┼──────────┼────────────┼─────────────┤
│ 1 │ 0.428364 │ 0.430152 │ 0.428364 │ 0.428364 │ 1.17258e8 │
│ 2 │ 0.407952 │ 0.407952 │ 0.406015 │ 0.406015 │ 4.39712e7 │
│ 3 │ 0.378004 │ 0.378004 │ 0.376216 │ 0.376216 │ 2.6432e7 │
│ 4 │ 0.385453 │ 0.38739 │ 0.385453 │ 0.385453 │ 2.16104e7 │
│ 5 │ 0.396777 │ 0.398565 │ 0.396777 │ 0.396777 │ 1.83624e7 │
│ 6 │ 0.420914 │ 0.422851 │ 0.420914 │ 0.420914 │ 1.21576e7 │
Thanks to @Matt B. for the suggestions to simplify the code.
The problem with the above is that the column types are Any
inside the dataframe. To make it a bit more efficient here are a few functions which get the job done:
# first, guess the Julia equivalent of type of the object
function guess_type(x::PyCall.PyObject)
string_dtype = x[:dtype][:name]
julia_string = string(uppercase(string_dtype[1]), string_dtype[2:end])
return eval(parse("$julia_string"))
end
# convert an individual column, falling back to Any array if the guess was wrong
function convert_column(x)
y = try Array{guess_type(x)}(x) catch Array(x) end
return y
end
# put everything together into a single function
function convert_pandas(df)
colnames = map(Symbol, data[:columns])
y = DataFrame(Any[convert_column(df[c]) for c in colnames], colnames)
return y
end
The above, when applied to your data
gives the same column names as before, but with correct Float64
column types:
y = convert_pandas(data);
showcols(y)
9147×12 DataFrames.DataFrame
│ Col # │ Name │ Eltype │ Missing │
├───────┼─────────────┼─────────┼─────────┤
│ 1 │ Open │ Float64 │ 0 │
│ 2 │ High │ Float64 │ 0 │
│ 3 │ Low │ Float64 │ 0 │
│ 4 │ Close │ Float64 │ 0 │
│ 5 │ Volume │ Float64 │ 0 │
│ 6 │ Ex-Dividend │ Float64 │ 0 │
│ 7 │ Split Ratio │ Float64 │ 0 │
│ 8 │ Adj. Open │ Float64 │ 0 │
│ 9 │ Adj. High │ Float64 │ 0 │
│ 10 │ Adj. Low │ Float64 │ 0 │
│ 11 │ Adj. Close │ Float64 │ 0 │
│ 12 │ Adj. Volume │ Float64 │ 0 │
You're running into a difference in Python/Pandas versions. I happen to have two configurations easily available to me; Pandas 0.18.0 in Python 2 and Pandas 0.19.1 in Python 3. The answer @niczky12 provided works well in the first configuration, but I'm seeing your Dict{Any,Any}
behavior in the second configuration. Basically, something changes between those two configurations such that PyCall detects a mapping-like interface for Pandas objects and then exposes that interface as a dictionary through an automatic conversion. There are two options here:
Work with the dictionary interface:
data = quandl.get("WIKI/AAPL", returns = "pandas")
cols = keys(data)
df = DataFrame(Any[collect(values(data[c])) for c in cols], map(Symbol, cols))
Explicitly disable the auto-conversion and use the PyCall interface to extract the columns as niczky12 demonstrated in the other answer. Note that data[:Open]
will do auto-conversion to a mapped dictionary and data["Open"]
will just return a PyObject
.
data = pycall(quandl.get, PyObject, "WIKI/AAPL", returns = "pandas")
cols = data[:columns]
df = DataFrame(Any[Array(data[c]) for c in cols], map(Symbol, cols))
In both cases, though, note that the all-important date index isn't included in the resulting data frame. You almost certainly want to add that as a column:
df[:Date] = collect(data[:index])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With