Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas read_csv on 6.5 GB file consumes more than 170GB RAM

I wanted to bring this up, just because it's crazy weird. Maybe Wes has some idea. The file is pretty regular: 1100 rows x ~3M columns, data are tab-separated, consisting solely of the integers 0, 1, and 2. Clearly this is not expected.

If I prepopulate a dataframe as below, it consumes ~26GB of RAM.

h = open("ms.txt")
header = h.readline().split("\t")
h.close()
rows=1100
df = pd.DataFrame(columns=header, index=range(rows), dtype=int)

System info:

  • python 2.7.9
  • ipython 2.3.1
  • numpy 1.9.1
  • pandas 0.15.2.

Any ideas welcome.

like image 362
Chris F. Avatar asked Jan 29 '15 16:01

Chris F.


People also ask

How many GB can pandas handle?

The upper limit for pandas Dataframe was 100 GB of free disk space on the machine. When your Mac needs memory, it will push something that isn't currently being used into a swapfile for temporary storage.

How does pandas work with large CSV files?

Using pandas. One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.

How do I make pandas use less memory?

Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage.

Is read_csv faster than Read_excel?

Importing csv files in Python is 100x faster than Excel files. We can now load these files in 0.63 seconds. That's nearly 10 times faster! Python loads CSV files 100 times faster than Excel files.


1 Answers

Problem of your example.

Trying your code on small scale, I notice even if you set dtype=int, you are actually ending up with dtype=object in your resulting dataframe.

header = ['a','b','c']
rows = 11
df = pd.DataFrame(columns=header, index=range(rows), dtype=int)

df.dtypes
a    object
b    object
c    object
dtype: object

This is because even though you give the pd.read_csv function the instruction that the columns are dtype=int, it cannot override the dtypes being ultimately determined by the data in the column.

This is because pandas is tightly coupled to numpy and numpy dtypes.

The problem is, there is no data in your created dataframe, thus numpy defaults the data to be np.NaN, which does not fit in an integer.

This means numpy gets confused and defaults back to the dtype being object.

Problem of the object dtype.

Having the dtype set to object means a big overhead in memory consumption and allocation time compared to if you would have the dtype set as integer or float.

Workaround for your example.

df = pd.DataFrame(columns=header, index=range(rows), dtype=float)

This works just fine, since np.NaN can live in a float. This produces

a    float64
b    float64
c    float64
dtype: object

And should take less memory.

More on how to relate to dtypes

See this related post for details on dtype: Pandas read_csv low_memory and dtype options

like image 97
firelynx Avatar answered Sep 29 '22 09:09

firelynx