I have some very large txt fils(about 1.5 GB ) which I want to load into Python as an array. The Problem is in this data a comma is used as a decimal separator. for smaller fils I came up with this solution:
import numpy as np
data= np.loadtxt(file, dtype=np.str, delimiter='\t', skiprows=1)
data = np.char.replace(data, ',', '.')
data = np.char.replace(data, '\'', '')
data = np.char.replace(data, 'b', '').astype(np.float64)
But for the large fils Python runs into an Memory Error. Is there any other more memory efficient way to load this data?
The problem with np.loadtxt(file, dtype=np.str, delimiter='\t', skiprows=1)
is that it uses python objects (strings) instead of float64
, which is very memory inefficient. You can use pandas read_table
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html#pandas.read_table
to read your file and set decimal=',' to change the default behaviour. This will allow for seamless reading and converting your strings into floats. After loading pandas dataframe use df.values to get a numpy array. If it's still too large for your memory use chunks
http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking
If still no luck try np.float32 format which further halves memory footprint.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With