Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to process huge datasets in kedro

Tags:

python

kedro

I have pretty big (~200Gb, ~20M lines) raw jsonl dataset. I need to extract important properties from there and store the intermediate dataset in csv for further conversion into something like HDF5, parquet, etc. Obviously, I can't use JSONDataSet for loading raw dataset, because it utilizes pandas.read_json under the hood, and using pandas for the dataset of such size sounds like a bad idea. So I'm thinking about reading the raw dataset line by line, process and append processed data line by line to the intermediate dataset.

What I can't understand is how to make this compatible with AbstractDataSet with its _load and _save methods.

P.S. I understand I can move this out of kedro's context, and introduce preprocessed dataset as a raw one, but that kinda breaks the whole idea of complete pipelines.

like image 677
eawer Avatar asked Feb 20 '20 22:02

eawer


People also ask

How does kaggle deal with large datasets?

When working with a new dataset, I usually create a first notebook to load all relevant files, convert datatypes, save the DataFrame as a pickle file and then only load this in the main feature-enginering notebook. This saves time and memory when I actuall start working with the data .

Is Kedro useful?

You can focus on solving problems, not setting up projects, Kedro provides the scaffolding to build more complex data and machine-learning pipelines. There's a focus on spending less time on the tedious “plumbing” required to maintain analytics code; this means that you have more time to solve new problems.

How Kedro works?

Kedro uses configuration files to make a project's code reproducible across different environments, when it may need to reference datasets in different locations. This tutorial makes use of three datasets for spaceflight companies shuttling customers to the moon and back, and uses two data formats: . csv and . xlsx.


1 Answers

Try to use pyspark to leverage lazy evaluation and batch execution. SparkDataSet is implemented in kedro.contib.io.spark_data_set

Sample catalog config for jsonl:

your_dataset_name:   
  type: kedro.contrib.io.pyspark.SparkDataSet
  filepath: "\file_path"
  file_format: json
  load_args:
    multiline: True
like image 76
gilgorio Avatar answered Oct 11 '22 12:10

gilgorio