Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I read a large csv file with pandas?

People also ask

How do I import a large CSV file into Pandas?

To read large CSV files in chunks in Pandas, use the read_csv(~) method and specify the chunksize parameter. This is particularly useful if you are facing a MemoryError when trying to read in the whole DataFrame at once.

How do I read a large csv file in Python?

read_csv(chunksize) One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.

How do I open a CSV file that is too large?

So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.

How many GB can Pandas read?

The upper limit for pandas Dataframe was 100 GB of free disk space on the machine. When your Mac needs memory, it will push something that isn't currently being used into a swapfile for temporary storage. When it needs access again, it will read the data from the swap file and back into memory.


The error shows that the machine does not have enough memory to read the entire CSV into a DataFrame at one time. Assuming you do not need the entire dataset in memory all at one time, one way to avoid the problem would be to process the CSV in chunks (by specifying the chunksize parameter):

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

The chunksize parameter specifies the number of rows per chunk. (The last chunk may contain fewer than chunksize rows, of course.)


pandas >= 1.2

read_csv with chunksize returns a context manager, to be used like so:

chunksize = 10 ** 6
with pd.read_csv(filename, chunksize=chunksize) as reader:
    for chunk in reader:
        process(chunk)

See GH38225


Chunking shouldn't always be the first port of call for this problem.

  1. Is the file large due to repeated non-numeric data or unwanted columns?

    If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv usecols parameter.

  2. Does your workflow require slicing, manipulating, exporting?

    If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.

  3. If all else fails, read line by line via chunks.

    Chunk via pandas or via csv library as a last resort.


For large data l recommend you use the library "dask"
e.g:

# Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')

You can read more from the documentation here.

Another great alternative would be to use modin because all the functionality is identical to pandas yet it leverages on distributed dataframe libraries such as dask.

From my projects another superior library is datatables.

# Datatable python library
import datatable as dt
df = dt.fread("s3://.../2018-*-*.csv")

I proceeded like this:

chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
       names=['lat','long','rf','date','slno'],index_col='slno',\
       header=None,parse_dates=['date'])

df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)

You can read in the data as chunks and save each chunk as pickle.

import pandas as pd 
import pickle

in_path = "" #Path where the large file is
out_path = "" #Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"

reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size, 
                    low_memory=False)    


for i, chunk in enumerate(reader):
    out_file = out_path + "/data_{}.pkl".format(i+1)
    with open(out_file, "wb") as f:
        pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)

In the next step you read in the pickles and append each pickle to your desired dataframe.

import glob
pickle_path = "" #Same Path as out_path i.e. where the pickle files are

data_p_files=[]
for name in glob.glob(pickle_path + "/data_*.pkl"):
   data_p_files.append(name)


df = pd.DataFrame([])
for i in range(len(data_p_files)):
    df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)

I want to make a more comprehensive answer based off of the most of the potential solutions that are already provided. I also want to point out one more potential aid that may help reading process.

Option 1: dtypes

"dtypes" is a pretty powerful parameter that you can use to reduce the memory pressure of read methods. See this and this answer. Pandas, on default, try to infer dtypes of the data.

Referring to data structures, every data stored, a memory allocation takes place. At a basic level refer to the values below (The table below illustrates values for C programming language):

The maximum value of UNSIGNED CHAR = 255                                    
The minimum value of SHORT INT = -32768                                     
The maximum value of SHORT INT = 32767                                      
The minimum value of INT = -2147483648                                      
The maximum value of INT = 2147483647                                       
The minimum value of CHAR = -128                                            
The maximum value of CHAR = 127                                             
The minimum value of LONG = -9223372036854775808                            
The maximum value of LONG = 9223372036854775807

Refer to this page to see the matching between NumPy and C types.

Let's say you have an array of integers of digits. You can both theoretically and practically assign, say array of 16-bit integer type, but you would then allocate more memory than you actually need to store that array. To prevent this, you can set dtype option on read_csv. You do not want to store the array items as long integer where actually you can fit them with 8-bit integer (np.int8 or np.uint8).

Observe the following dtype map.

Source: https://pbpython.com/pandas_dtypes.html

You can pass dtype parameter as a parameter on pandas methods as dict on read like {column: type}.

import numpy as np
import pandas as pd

df_dtype = {
        "column_1": int,
        "column_2": str,
        "column_3": np.int16,
        "column_4": np.uint8,
        ...
        "column_n": np.float32
}

df = pd.read_csv('path/to/file', dtype=df_dtype)

Option 2: Read by Chunks

Reading the data in chunks allows you to access a part of the data in-memory, and you can apply preprocessing on your data and preserve the processed data rather than raw data. It'd be much better if you combine this option with the first one, dtypes.

I want to point out the pandas cookbook sections for that process, where you can find it here. Note those two sections there;

  • Reading a csv chunk-by-chunk
  • Reading only certain rows of a csv chunk-by-chunk

Option 3: Dask

Dask is a framework that is defined in Dask's website as:

Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love

It was born to cover the necessary parts where pandas cannot reach. Dask is a powerful framework that allows you much more data access by processing it in a distributed way.

You can use dask to preprocess your data as a whole, Dask takes care of the chunking part, so unlike pandas you can just define your processing steps and let Dask do the work. Dask does not apply the computations before it is explicitly pushed by compute and/or persist (see the answer here for the difference).

Other Aids (Ideas)

  • ETL flow designed for the data. Keeping only what is needed from the raw data.
    • First, apply ETL to whole data with frameworks like Dask or PySpark, and export the processed data.
    • Then see if the processed data can be fit in the memory as a whole.
  • Consider increasing your RAM.
  • Consider working with that data on a cloud platform.