Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Fast and efficient way of writing large text file

I have a speed/efficiency related question about python:

I need to write a large number of very large R dataframe-ish files, about 0.5-2 GB sizes. This is basically a large tab-separated table, where each line can contain floats, integers and strings.

Normally, I would just put all my data in numpy dataframe and use np.savetxt to save it, but since there are different data types it can't really be put into one array.

Therefore I have resorted to simply assembling the lines as strings manually, but this is a tad slow. So far I'm doing:

1) Assemble each line as a string 2) Concatenate all lines as single huge string 3) Write string to file

I have several problems with this: 1) The large number of string-concatenations ends up taking a lot of time 2) I run of of RAM to keep strings in memory 3) ...which in turn leads to more separate file.write commands, which are very slow as well.

So my question is: What is a good routine for this kind of problem? One that balances out speed vs memory-consumption for most efficient string-concatenation and writing to disk.

... or maybe this strategy is simply just bad and I should do something completely different?

Thanks in advance!

like image 720
Misconstruction Avatar asked Feb 25 '14 21:02

Misconstruction


2 Answers

Seems like Pandas might be a good tool for this problem. It's pretty easy to get started with pandas, and it deals well with most ways you might need to get data into python. Pandas deals well with mixed data (floats, ints, strings), and usually can detect the types on its own.

Once you have an (R-like) data frame in pandas, it's pretty straightforward to output the frame to csv.

DataFrame.to_csv(path_or_buf, sep='\t')

There's a bunch of other configuration things you can do to make your tab separated file just right.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

like image 103
Joan Smith Avatar answered Sep 22 '22 14:09

Joan Smith


Unless you are running into a performance issue, you can probably write to the file line by line. Python internally uses buffering and will likely give you a nice compromise between performance and memory efficiency.

Python buffering is different from OS buffering and you can specify how you want things buffered by setting the buffering argument to open.

like image 27
Clarus Avatar answered Sep 24 '22 14:09

Clarus