What is the best strategy to cache/generate the dataset in a TPU acceptable way?
I manage so far to train tensorflow models on a dataset I am creating myself. Each datapoint is heavily engineered based on a large time series using bespoke logic built on numpy, pandas, scipy and other python packages. The final dataset creation step looks like this:
train_ds = tf.data.Dataset.from_generator(
generator=data_gen,
output_types=(tf.float32, tf.float32),
output_shapes=([None, len(cols_of_interest)],()),
args=([True, datafile, REGRESSION, num_features])
)
The model is fine and converges when using CPUs.
When I move to using the TPU in Google Colab I get the error related to not being able to run my data_gen function on TPU :
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)
NotFoundError: No registered 'PyFunc' OpKernel for 'CPU' devices compatible with node {{node PyFunc}}
. Registered: <no registered kernels>
[[PyFunc]]
[[IteratorGetNext]]
Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
Additional GRPC error information:
{"created":"@1576029311.304590975","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":" No registered 'PyFunc' OpKernel for 'CPU' devices compatible with node {{node PyFunc}}\n\t. Registered: <no registered kernels>\n\n\t [[PyFunc]]\n\t [[IteratorGetNext]]\n\tEncountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.","grpc_status":5} [Op:__inference_distributed_function_5899]
Function call stack:
distributed_function -> distributed_function
In my strategy, I generate from one single, original data file multiple data points. It all varies based on the content of the file so there is no way I can tell upfront how many data points are there. There is no correlation between the numbers of files and the final number of data points. I also looked at TimeseriesGenerator and it doesn't work for me.
I'm manipulating medical sensor data so the original files are in the range of tens of GB. Saving everything in one big file doesn't really work.
What is the best strategy to cache/generate the dataset in a TPU acceptable way?
I've never worked with TPU, but your error seems related to mixed CPU/TPU execution. TensorFlow dataset API use graph mode execution and thus requires that your operations are available for your accelerator, which is probably not the case for a "custom python generator" identified as PyFunc operation. This is my understanding, and it can be wrong.
By the way, one way to workaround this is to separate your dataset generation and your TPU training.
Use a first task to generate your dataset and save it to disk. TFRecord format allow you to split your dataset in many small files instead of one big file. You can also look for some size optimization: compress your data, can you use sparse tensors, etc.
Then you will be able to use a more classical tf.data.TFRecordDataset, that I think is better suited than GeneratorDataset for TPU execution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With