Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: load data with comma as decimal separator

Tags:

python

enter image description hereI have some very large txt fils(about 1.5 GB ) which I want to load into Python as an array. The Problem is in this data a comma is used as a decimal separator. for smaller fils I came up with this solution:

import numpy as np
data= np.loadtxt(file, dtype=np.str, delimiter='\t', skiprows=1)
        data = np.char.replace(data, ',', '.')
        data = np.char.replace(data, '\'', '')
        data = np.char.replace(data, 'b', '').astype(np.float64)

But for the large fils Python runs into an Memory Error. Is there any other more memory efficient way to load this data?

like image 350
Greg.P Avatar asked Oct 30 '22 14:10

Greg.P


1 Answers

The problem with np.loadtxt(file, dtype=np.str, delimiter='\t', skiprows=1) is that it uses python objects (strings) instead of float64, which is very memory inefficient. You can use pandas read_table

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html#pandas.read_table

to read your file and set decimal=',' to change the default behaviour. This will allow for seamless reading and converting your strings into floats. After loading pandas dataframe use df.values to get a numpy array. If it's still too large for your memory use chunks

http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking

If still no luck try np.float32 format which further halves memory footprint.

like image 189
Dennis Sakva Avatar answered Nov 15 '22 07:11

Dennis Sakva