Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load a directory of GRIB files into a Dask array

Tags:

dask

gdal

grib

Suppose I have a directory with thousands of GRIB files. I want to load those files into a dask array so I can query them. How can I go about doing this? The attempt below seems to work, but it requires each GRIB file to be opened, and it takes a long time to run and all of my memory. There must be a better way.

My attempt:

import dask.array as da
from dask import delayed
import gdal
import glob
import os


def load(filedir):
    files = sorted(glob.glob(os.path.join(filedir, '*.grb')))
    data = [da.from_array(gdal.Open(f).ReadAsArray(), chunks=[500,500,500], name=f) for f in files]
    return da.stack(data, axis=0)

file_dir = ...
array = load(file_dir)
like image 494
Philip Blankenau Avatar asked May 08 '17 18:05

Philip Blankenau


1 Answers

The best way to do this would be to use dask.delayed. In this case, you'd create a delayed function to read the array, and then compose a dask array from those delayed objects using the da.from_delayed function. Something along the lines of:

# This function isn't run until compute time
@dask.delayed(pure=True)
def load(file):
    return gdal.Open(file).ReadAsArray()


# Create several delayed objects, then turn each into a dask
# array. Note that you need to know the shape and dtype of each
# file
data = [da.from_delayed(load(f), shape=shape_of_f, dtype=dtype_of_f)
        for f in files]

x = da.stack(data, axis=0)

Note that this makes a single task for loading each file. If the individual files are large, you may want to chunk them yourself in the load function. I'm not familiar with gdal, but from a brief look at the ReadAsArray method this may be doable with the xoff/yoff/xsize/ysize parameters (not sure). You'd have to write this code yourself, but it may be more performant for large files.

Alternatively you could use the code above, and then call rechunk to rechunk into smaller chunks. This would still result in reading in each file in a single task, but subsequent steps could work with smaller chunks. Whether this is worth it or not depends on the size of your individual files.

x = x.rechunk((500, 500, 500))  # or whatever chunks you want
like image 78
jiminy_crist Avatar answered Oct 15 '22 02:10

jiminy_crist