I want to read the file f (file size:85GB) in chunks to a dataframe. Following code is suggested. <pre class="prettyprint"><code>chunksize = 5 TextFileReader = pd.read_csv(f, chunksize=chunksize) </code></pre> However, this code gives me TextFileReader, not dataframe. Also, I don't want to concatenate these chunks to convert TextFileReader to dataframe because of the memory limit. Please advise.

As you are trying to process 85GB CSV file, if you will try to read all the data by breaking it into chunks and converting it into dataframe then it will hit memory limit for sure. You can try to solve this problem by using different approach. In this case, you can use filtering operations on your data. For example, if there are 600 columns in your dataset and you are interested only in 50 columns. Try to read only 50 columns from the file. This way you will save lot of memory. Process your rows as you read them. If you need to filter the data first, use a generator function. <code>yield</code> makes a function a generator function, which means it won't do any work until you start looping over it. For more information regarding generator function: Reading a huge .csv file For efficient filtering refer: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3 For processing smaller dataset: Approach 1: To convert reader object to dataframe directly: <pre class="prettyprint"><code>full_data = pd.concat(TextFileReader, ignore_index=True) </code></pre> It is necessary to add parameter ignore index to function concat, because avoiding duplicity of indexes. Approach 2: Use Iterator or get_chunk to convert it into dataframe. By specifying a chunksize to read_csv,return value will be an iterable object of type TextFileReader. <pre class="prettyprint"><code>df=TextFileReader.get_chunk(3) for chunk in TextFileReader: print(chunk) </code></pre> Source : http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking <code>df= pd.DataFrame(TextFileReader.get_chunk(1))</code> This will convert one chunk to dataframe. Checking total number of chunks in TextFileReader <pre class="prettyprint lang-py prettyprint-override"><code>for chunk_number, chunk in enumerate(TextFileReader): # some code here, if needed pass print("Total number of chunks is", chunk_number+1) </code></pre> If file size is bigger,I won't recommend second approach. For example, if csv file consist of 100000 records then chunksize=5 will create 20,000 chunks.

How to read data in Python dataframe without concatenating?

Tags:

python

pandas

dataframe

chunks

csv

I want to read the file f (file size:85GB) in chunks to a dataframe. Following code is suggested.

chunksize = 5
TextFileReader = pd.read_csv(f, chunksize=chunksize)

However, this code gives me TextFileReader, not dataframe. Also, I don't want to concatenate these chunks to convert TextFileReader to dataframe because of the memory limit. Please advise.

247

asked Sep 08 '16 08:09

Geet

2 Answers

As you are trying to process 85GB CSV file, if you will try to read all the data by breaking it into chunks and converting it into dataframe then it will hit memory limit for sure. You can try to solve this problem by using different approach. In this case, you can use filtering operations on your data. For example, if there are 600 columns in your dataset and you are interested only in 50 columns. Try to read only 50 columns from the file. This way you will save lot of memory. Process your rows as you read them. If you need to filter the data first, use a generator function. yield makes a function a generator function, which means it won't do any work until you start looping over it.

For more information regarding generator function: Reading a huge .csv file

For efficient filtering refer: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3

For processing smaller dataset:

Approach 1: To convert reader object to dataframe directly:

full_data = pd.concat(TextFileReader, ignore_index=True)

It is necessary to add parameter ignore index to function concat, because avoiding duplicity of indexes.

Approach 2: Use Iterator or get_chunk to convert it into dataframe.

By specifying a chunksize to read_csv,return value will be an iterable object of type TextFileReader.

df=TextFileReader.get_chunk(3)

for chunk in TextFileReader:
    print(chunk)

Source : http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking

df= pd.DataFrame(TextFileReader.get_chunk(1))

This will convert one chunk to dataframe.

Checking total number of chunks in TextFileReader

for chunk_number, chunk in enumerate(TextFileReader):
    # some code here, if needed
    pass

print("Total number of chunks is", chunk_number+1)

If file size is bigger,I won't recommend second approach. For example, if csv file consist of 100000 records then chunksize=5 will create 20,000 chunks.

answered Sep 18 '22 06:09

Sayali Sonawane

If you want to receive a data frame as a result of working with chunks, you can do it this way. Initialize empty data frame before you initialize chunk iterations. After you did the filtering process you can concatenate every result into your dataframe. As a result you will receive a dataframe filtered by your condition under the for loop.

file = 'results.csv'
df_empty = pd.DataFrame()
with open(file) as fl:
    chunk_iter = pd.read_csv(fl, chunksize = 100000)
    for chunk in chunk_iter:
        chunk = chunk[chunk['column1'] > 180]
        df_empty = pd.concat([df_empty,chunk])

answered Sep 22 '22 06:09

julliet

Related questions
                            
                                Monitoring django rest framework api on production server
                            
                                Attach a queue to a numpy array in tensorflow for data fetch instead of files?
                            
                                How to check for empty request.FILE in Django
                            
                                OpenCV for Python 3.5.1
                            
                                Python: Read hex from file into list?
                            
                                sum values of columns starting with the same string in pandas dataframe
                            
                                Parsing through json data for aws sns event data in python
                            
                                How to divide each element in a tuple by a single integer? [closed]
                            
                                Save pandas dataframe but conserving NA values
                            
                                Convert unicode json to normal json in python
                            
                                How to change font size in ttk.Button?
                            
                                PyCharm - can't use remote interpreter
                            
                                tflearn / tensorflow does not learn xor
                            
                                Can't install PIL
                            
                                PyCharm Cannot Run Program C:\\Anaconda\\python.exe
                            
                                AttributeError: 'Graph' object has no attribute 'cypher' in migration of data from Postgress to Neo4j(Graph Database)
                            
                                openpyxl: assign value or apply format to a range of Excel cells without iteration
                            
                                Download a file from a Flask-based Python server
                            
                                List index out of range with Panda read_csv
                            
                                Remove special characters in pandas dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With