Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to iterate through the dataset from hugging face?

I have a dataset saved in a local drive, and i can load and convert this dataset to json or csv format and saved it to my local drive. I want to know , if there is a way i can stream this data and iterate through item. how can i iterate through ds? I can save it as a csv and process data row by row. or can i stream the data and iterate through each item?

from datasets import load_from_disk
import json


ds = load_from_disk(f"local_path"...)
ds.to_json(mydata.json) 



like image 699
ozil Avatar asked Oct 15 '25 19:10

ozil


1 Answers

Here is the code of how to iterate a huggingface dataset

from datasets import load_from_disk
ds = load_from_disk(f"local_path"...)
iter=hf_dataset_mined1.iter(batch_size=1)
for i in iter:
    print(f'i is {i}')

here is some example output

i is {'question_id': [34705205], 'parent_answer_post_id': [34705233], 'prob': [0.8690001442846342], 'snippet': ['sorted(l, key=lambda x: (-int(x[1]), x[0]))'], 'intent': ['Sort a nested list by two elements'], 'id': ['34705205_34705233_0']}
i is {'question_id': [13905936], 'parent_answer_post_id': [13905946], 'prob': [0.8526701436370034], 'snippet': ['[int(x) for x in str(num)]'], 'intent': ['converting integer to list in python'], 'id': ['13905936_13905946_0']}
i is {'question_id': [13837848], 'parent_answer_post_id': [13838041], 'prob': [0.8521431843789492], 'snippet': ["c.decode('unicode_escape')"], 'intent': ['Converting byte string in unicode string'], 'id': ['13837848_13838041_0']}
i is {'question_id': [23490152], 'parent_answer_post_id': [23490179], 'prob': [0.850829261966626], 'snippet': ["parser.add_argument('-t', dest='table', help='', nargs='+')"], 'intent': ['List of arguments with argparse'], 'id': ['23490152_23490179_0']}
like image 160
Robin Avatar answered Oct 18 '25 08:10

Robin