I have pretty big (~200Gb, ~20M lines) raw jsonl dataset. I need to extract important properties from there and store the intermediate dataset in csv for further conversion into something like HDF5, parquet, etc. Obviously, I can't use JSONDataSet
for loading raw dataset, because it utilizes pandas.read_json
under the hood, and using pandas for the dataset of such size sounds like a bad idea. So I'm thinking about reading the raw dataset line by line, process and append processed data line by line to the intermediate dataset.
What I can't understand is how to make this compatible with AbstractDataSet
with its _load
and _save
methods.
P.S. I understand I can move this out of kedro's context, and introduce preprocessed dataset as a raw one, but that kinda breaks the whole idea of complete pipelines.
When working with a new dataset, I usually create a first notebook to load all relevant files, convert datatypes, save the DataFrame as a pickle file and then only load this in the main feature-enginering notebook. This saves time and memory when I actuall start working with the data .
You can focus on solving problems, not setting up projects, Kedro provides the scaffolding to build more complex data and machine-learning pipelines. There's a focus on spending less time on the tedious “plumbing” required to maintain analytics code; this means that you have more time to solve new problems.
Kedro uses configuration files to make a project's code reproducible across different environments, when it may need to reference datasets in different locations. This tutorial makes use of three datasets for spaceflight companies shuttling customers to the moon and back, and uses two data formats: . csv and . xlsx.
Try to use pyspark to leverage lazy evaluation and batch execution. SparkDataSet is implemented in kedro.contib.io.spark_data_set
Sample catalog config for jsonl:
your_dataset_name:
type: kedro.contrib.io.pyspark.SparkDataSet
filepath: "\file_path"
file_format: json
load_args:
multiline: True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With