Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to identify each file of origin when concatinating many netcdf files with ncrcat?

I am concatenating 1000s of nc-files (outputs from simulations) to allow me to handle them more easily in Matlab. To do this I use ncrcat. The files have different sizes, and the time variable is not unique between files. The concatenate works well and allows me to read the data into Matlab much quicker than individually reading the files. However, I want to be able to identify the original nc-file from which each data point originates. Is it possible to, say, add the source filename as an extra variable so I can trace back the data?

like image 715
Chris Coffey Avatar asked Mar 04 '23 21:03

Chris Coffey


2 Answers

Easiest way: Online indexing

Before we start, I would use an integer index rather than the filename to identify each run, as it is a lot easier to handle, both for writing and then for handling in the matlab programme. Rather than a simple monotonically increasing index, the identifier can have relevance for your run (or you can even write several separate indices if necessary (e.g. you might have a number for the resolution, the date, the model version etc).

So, the obvious way to do this that I can think of would be that each simulation writes an index to the file to identify itself. i.e. the first model run would write a variable

myrun=1

the second

myrun=2

and so on... then when you cat the files the data can be uniquely identified very easily using this index.

Note that if your spatial dimensions are not unique and the number of time steps also changes from run to run from what you write, your index will need to be a function of all the non-unique dimensions, e.g. myrun(x,y,t). If any of your dimensions are unique across all files then that dimension is redundant in the index and can be omitted.

Of course, the only issue with this solution is it means running the simulations again :-D and you might be talking about an expensive model to run or someone else's runs you can't repeat. If rerunning is out of the question you will need to try to add an index offline...

Offline indexing (easy if grids are same, more complex otherwise)

IF your space dimensions were the same across all files, then this is still an easy task as you can add an index offline very easily across all the time steps in each file using nco:

ncap2 -s 'myrun[$time]=array(X,0,$time)' infile.nc  outfile.nc

or if you are happy to overwrite the original file (be careful!)

ncap2 -O -s 'myrun[$time]=array(X,0,$time)'

where X is the run number. This will add a variable, with a new variable myrun which is a function of time and then puts X at each step. When you merge you can see which data slice was from which specific run.

By the way, the second zero is the increment, as this is set to zero the number X will be written for all timesteps in a given file (otherwise if it were 1, the index would increase by one each timestep - this could be useful in some cases. For example, you might use two indices, one with increment of zero to identify the run, and the second with an increment of unity to easily tell you which step of the Xth run the data slice belongs to).

If your files are for different domains too, then you might want to put them on a common grid before you do that... I think for that

cdo enlarge 

might be of help, see this post : https://code.mpimet.mpg.de/boards/2/topics/1459

like image 79
Adrian Tompkins Avatar answered May 18 '23 21:05

Adrian Tompkins


I agree that an index will be simpler than a filename. I would just add to the above answer that the command to add a unique index X with a time dimension to each input file can be simplified to

ncap2 -s 'myrun[$time]=X' in.nc out.nc
like image 40
Charlie Zender Avatar answered May 18 '23 20:05

Charlie Zender