I have a list of the URLs, and I'd love to read them to the dask data frame
at once, but it looks like read_csv
can't use an asterisk for http
. Is there any way to achieve that?
Here is an example:
link = 'http://web.mta.info/developers/'
data = [ 'data/nyct/turnstile/turnstile_170128.txt',
'data/nyct/turnstile/turnstile_170121.txt',
'data/nyct/turnstile/turnstile_170114.txt',
'data/nyct/turnstile/turnstile_170107.txt'
]
and what I want is
df = dd.read_csv('XXXX*X')
Try using dask.delayed to turn each of your urls into a lazy pandas dataframe and then use dask.dataframe.from_delayed to turn those lazy values into a full dask dataframe
import pandas as pd
import dask
import dask.dataframe as dd
dfs = [dask.delayed(pd.read_csv)(url) for url in urls]
df = dd.from_delayed(dfs)
This will read one of your links immediately in order to figure out metadata (column, dtypes). If you know these dtypes and links ahead of time then you can avoid this by passing a sample empty dataframe to dd.from_delayed(..., meta=sample_df)
See also: http://dask.pydata.org/en/latest/delayed-collections.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With