I am trying to work with data from very large netCDF files (~400 Gb each). Each file has a few variables, all much larger than the system memory (e.g. 180 Gb vs 32 Gb RAM). I am trying to use numpy and netCDF4-python do some operations on these variables by copying a slice at a time and operating on that slice. Unfortunately, it is taking a really long time just to read each slice, which is killing the performance.
For example, one of the variables is an array of shape (500, 500, 450, 300)
. I want to operate on the slice [:,:,0]
, so I do the following:
import netCDF4 as nc
f = nc.Dataset('myfile.ncdf','r+')
myvar = f.variables['myvar']
myslice = myvar[:,:,0]
But the last step takes a really long time (~5 min on my system). If for example I saved a variable of shape (500, 500, 300)
on the netcdf file, then a read operation of the same size will take only a few seconds.
Is there any way I can speed this up? An obvious path would be to transpose the array so that the indices that I am selecting would come up first. But in such a large file this would not be possible to do in memory, and it seems even slower to attempt it given that a simple operation already takes a long time. What I would like is a quick way to read a slice of a netcdf file, in the fashion of the Fortran's interface get_vara function. Or some way of efficiently transposing the array.
Reading netCDF data using PythonWe start by opening the file that contains the variables we want to eventually plot. fh becomes the file handle of the open netCDF file, and the 'r' denotes that we want to open the file in read only mode. Now we can read the data from any of the variables contained in fh .
Compressing your netCDF data files can shrink your data files to one third of its original size. This is the equivalent of being given three times as much disk space. NetCDF compression is lossless: the data is exactly as it was when read from disk. It can still be read using the same programming interface.
Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.
You can transpose netCDF variables too large to fit in memory by using the nccopy utility, which is documented here:
http://www.unidata.ucar.edu/netcdf/docs/guide_nccopy.html
The idea is to "rechunk" the file by specifying what shapes of chunks (multidimensional tiles) you want for the variables. You can specify how much memory to use as a buffer and how much to use for chunk caches, but it's not clear how to use memory optimally between these uses, so you may have to just try some examples and time them. Rather than completely transpose a variable, you probably want to "partially transpose" it, by specifying chunks that have a lot of data along the 2 big dimensions of your slice and have only a few values along the other dimensions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With