How to process huge datasets in kedro

Tags:

I have pretty big (~200Gb, ~20M lines) raw jsonl dataset. I need to extract important properties from there and store the intermediate dataset in csv for further conversion into something like HDF5, parquet, etc. Obviously, I can't use JSONDataSet for loading raw dataset, because it utilizes pandas.read_json under the hood, and using pandas for the dataset of such size sounds like a bad idea. So I'm thinking about reading the raw dataset line by line, process and append processed data line by line to the intermediate dataset.

What I can't understand is how to make this compatible with AbstractDataSet with its _load and _save methods.

P.S. I understand I can move this out of kedro's context, and introduce preprocessed dataset as a raw one, but that kinda breaks the whole idea of complete pipelines.

677

asked Feb 20 '20 22:02

eawer

1 Answers

Try to use pyspark to leverage lazy evaluation and batch execution. SparkDataSet is implemented in kedro.contib.io.spark_data_set

Sample catalog config for jsonl:

your_dataset_name:   
  type: kedro.contrib.io.pyspark.SparkDataSet
  filepath: "\file_path"
  file_format: json
  load_args:
    multiline: True

answered Oct 11 '22 12:10

gilgorio

Related questions
                            
                                How to convert current datetime into 13 digits Unix timestamp? [duplicate]
                            
                                How to reference static method from class variable [duplicate]
                            
                                Permutations of a list with 16 integers but only if 4 conditions are fulfilled
                            
                                How can I rotate a matplotlib map?
                            
                                How to get the mode of distribution in scipy.stats
                            
                                What's the difference between auto_remove and remove in Docker SDK for python
                            
                                Why are deep learning libraries so huge?
                            
                                How to use nox with poetry?
                            
                                Split a list of dates into subsets of consecutive dates
                            
                                Visual Studio Code syntax highlighting not working
                            
                                Reading .dat file in python
                            
                                Feeding nullable data from BigQuery into Tensorflow Transform
                            
                                Does the django_address module provide a way to seed the initial country data?
                            
                                How to generate asgi.py for existent project?
                            
                                How do I correctly use mock call_args with Python's unittest.mock?
                            
                                Flask endpoint vs Sagemaker endpoint
                            
                                which python vs PYTHONPATH
                            
                                Do I need to split the data for isolation forest?
                            
                                Is it true that in multiprocessing, each process gets it's own GIL in CPython? How different is that from creating new runtimes?
                            
                                Django & mypy: ValuesQuerySet type hint

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to process huge datasets in kedro

Tags:

python

kedro

eawer

People also ask

1 Answers

gilgorio

Recent Activity

Donate For Us