How to use very large dataset in RNN TensorFlow?

Tags:

I have a very large dataset: 7.9 GB of CSV files. 80% of which shall serve as the training data, and the remaining 20% shall serve as test data. When I'm loading the training data (6.2 GB), I'm having MemoryError at the 80th iteration (80th file). Here's the script I'm using in loading the data:

import pandas as pd
import os

col_names = ['duration', 'service', 'src_bytes', 'dest_bytes', 'count', 'same_srv_rate',
        'serror_rate', 'srv_serror_rate', 'dst_host_count', 'dst_host_srv_count',
        'dst_host_same_src_port_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate',
        'flag', 'ids_detection', 'malware_detection', 'ashula_detection', 'label', 'src_ip_add',
        'src_port_num', 'dst_ip_add', 'dst_port_num', 'start_time', 'protocol']

# create a list to store the filenames
files = []

# create a dataframe to store the contents of CSV files
df = pd.DataFrame()

# get the filenames in the specified PATH
for (dirpath, dirnames, filenames) in os.walk(path):
    ''' Append to the list the filenames under the subdirectories of the <path> '''
    files.extend(os.path.join(dirpath, filename) for filename in filenames)

for file in files:
    df = df.append(pd.read_csv(filepath_or_buffer=file, names=col_names, engine='python'))
    print('Appending file : {file}'.format(file=files[index]))

pd.set_option('display.max_colwidth', -1)
print(df)

There are 130 files in the 6.2 GB worth of CSV files.

405

asked Jul 25 '17 09:07

afagarap

2 Answers

For large datasets - and we may already count 6.2GB as large - reading all the data in at once might not be the best idea. As you are going to train your network batch by batch anyway, it is sufficient to only load the data you need for the batch which is going to be used next.

The tensorflow documentation provides a good overview on how to implement a data reading pipeline. Stages according to the documentation linked are:

The list of filenames

Optional filename shuffling

Optional epoch limit

Filename queue

A Reader for the file format

A decoder for a record read by the reader

Optional preprocessing

Example queue

199

answered Sep 28 '22 18:09

Nyps

I second Nyps's answer, I just don't have enough reputation to add a comment just yet. Additionally, it might be interesting for you to open Task Manager or equivalent and observe the used memory of your system as you run this. I would guess that when your RAM entirely fills up, that's when you're getting your error.

TensorFlow supports queues, which allow you to only read portions of data at once, in order to not exhaust your memory. Examples for this are in the documentation that Nyps linked. Also, TensorFlow has recently added a new way to handle input datasets in TensorFlow Dataset docs.

Also, I would suggest converting all your data to TensorFlow's TFRecord format, as it will save space, and can speed up data accessing over 100 times compared to converting CSV files to tensors at training time.

answered Sep 28 '22 17:09

John Scolaro

Related questions
                            
                                Pandas KeyError using pivot
                            
                                How to randomly append "Yes/No" (ratio of 7:3) to a column in pandas dataframe?
                            
                                Cannot get right slice bound for non-unique label when indexing data frame with python-pandas
                            
                                Filling NAN data with mode() doesn't work -Pandas
                            
                                How to multiply every column of one Pandas Dataframe with every column of another Dataframe efficiently?
                            
                                pandas.DataFrame set all string values to nan
                            
                                Pandas: Change values chosen by boolean indexing in a column without getting a warning
                            
                                np.argsort which excludes zero values
                            
                                Pandas Sqlite query using variable
                            
                                In pandas, how do I flatten a group of rows
                            
                                Delete row based on nulls in certain columns (pandas)
                            
                                pandas merge on date column issue
                            
                                Bar chart pandas Dataframe with Bokeh
                            
                                Pandas: Count Distinct Combinations of two columns and add to Same Dataframe
                            
                                How do I rename an index row in Python Pandas? [duplicate]
                            
                                Error: pandas hashtable keyerror
                            
                                sklearn standardscaler result different to manual result
                            
                                python csv to dictionary using csv or pandas module
                            
                                Pandas groupby with delimiter join
                            
                                Pandas Iterrows Row Number & Percentage

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use very large dataset in RNN TensorFlow?

Tags:

pandas

machine-learning

tensorflow

dataset

data-processing

afagarap

People also ask

2 Answers

Nyps

John Scolaro

Recent Activity

Donate For Us