I wanted to bring this up, just because it's crazy weird. Maybe Wes has some idea. The file is pretty regular: 1100 rows x ~3M columns, data are tab-separated, consisting solely of the integers 0, 1, and 2. Clearly this is not expected.
If I prepopulate a dataframe as below, it consumes ~26GB of RAM.
h = open("ms.txt")
header = h.readline().split("\t")
h.close()
rows=1100
df = pd.DataFrame(columns=header, index=range(rows), dtype=int)
System info:
Any ideas welcome.
The upper limit for pandas Dataframe was 100 GB of free disk space on the machine. When your Mac needs memory, it will push something that isn't currently being used into a swapfile for temporary storage.
Using pandas. One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.
Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage.
Importing csv files in Python is 100x faster than Excel files. We can now load these files in 0.63 seconds. That's nearly 10 times faster! Python loads CSV files 100 times faster than Excel files.
Trying your code on small scale, I notice even if you set dtype=int
, you are actually ending up with dtype=object
in your resulting dataframe.
header = ['a','b','c']
rows = 11
df = pd.DataFrame(columns=header, index=range(rows), dtype=int)
df.dtypes
a object
b object
c object
dtype: object
This is because even though you give the pd.read_csv
function the instruction that the columns are dtype=int
, it cannot override the dtypes being ultimately determined by the data in the column.
This is because pandas is tightly coupled to numpy and numpy dtypes.
The problem is, there is no data in your created dataframe, thus numpy defaults the data to be np.NaN
, which does not fit in an integer.
This means numpy gets confused and defaults back to the dtype being object
.
Having the dtype set to object
means a big overhead in memory consumption and allocation time compared to if you would have the dtype set as integer or float.
df = pd.DataFrame(columns=header, index=range(rows), dtype=float)
This works just fine, since np.NaN
can live in a float. This produces
a float64
b float64
c float64
dtype: object
And should take less memory.
See this related post for details on dtype: Pandas read_csv low_memory and dtype options
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With