Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create Dask DataFrame from a list of urls?

I have a list of the URLs, and I'd love to read them to the dask data frame at once, but it looks like read_csv can't use an asterisk for http. Is there any way to achieve that?

Here is an example:

link = 'http://web.mta.info/developers/'

data = [     'data/nyct/turnstile/turnstile_170128.txt',
                        'data/nyct/turnstile/turnstile_170121.txt',
                        'data/nyct/turnstile/turnstile_170114.txt',
                        'data/nyct/turnstile/turnstile_170107.txt' 
        ]

and what I want is

df = dd.read_csv('XXXX*X')

like image 900
Philipp_Kats Avatar asked Feb 04 '23 18:02

Philipp_Kats


1 Answers

Try using dask.delayed to turn each of your urls into a lazy pandas dataframe and then use dask.dataframe.from_delayed to turn those lazy values into a full dask dataframe

import pandas as pd
import dask
import dask.dataframe as dd

dfs = [dask.delayed(pd.read_csv)(url) for url in urls]

df = dd.from_delayed(dfs)

This will read one of your links immediately in order to figure out metadata (column, dtypes). If you know these dtypes and links ahead of time then you can avoid this by passing a sample empty dataframe to dd.from_delayed(..., meta=sample_df)

See also: http://dask.pydata.org/en/latest/delayed-collections.html

like image 102
MRocklin Avatar answered Feb 07 '23 08:02

MRocklin