<p>I wanted to bring this up, just because it's crazy weird. Maybe Wes has some idea. The file is pretty regular: 1100 rows x ~3M columns, data are tab-separated, consisting solely of the integers 0, 1, and 2. Clearly this is not expected.</p> <p>If I prepopulate a dataframe as below, it consumes ~26GB of RAM.</p> <pre class="prettyprint"><code>h = open("ms.txt") header = h.readline().split("\t") h.close() rows=1100 df = pd.DataFrame(columns=header, index=range(rows), dtype=int) </code></pre> <p>System info:</p> <ul> <li>python 2.7.9 </li> <li>ipython 2.3.1 </li> <li>numpy 1.9.1 </li> <li>pandas 0.15.2.</li> </ul> <p>Any ideas welcome. </p>

<h3>Problem of your example.</h3> <p>Trying your code on small scale, I notice even if you set <code>dtype=int</code>, you are actually ending up with <code>dtype=object</code> in your resulting dataframe.</p> <pre class="prettyprint"><code>header = ['a','b','c'] rows = 11 df = pd.DataFrame(columns=header, index=range(rows), dtype=int) df.dtypes a object b object c object dtype: object </code></pre> <p>This is because even though you give the <code>pd.read_csv</code> function the instruction that the columns are <code>dtype=int</code>, it cannot override the dtypes being ultimately determined by the data in the column.</p> <p>This is because pandas is <em>tightly coupled</em> to numpy and numpy dtypes.</p> <p>The problem is, there is no data in your created dataframe, thus numpy defaults the data to be <code>np.NaN</code>, which does <em>not fit</em> in an integer.</p> <p>This means numpy gets confused and defaults back to the dtype being <code>object</code>.</p> <h3>Problem of the object dtype.</h3> <p>Having the dtype set to <code>object</code> means a big overhead in memory consumption and allocation time compared to if you would have the dtype set as integer or float.</p> <h3>Workaround for your example.</h3> <pre class="prettyprint"><code>df = pd.DataFrame(columns=header, index=range(rows), dtype=float) </code></pre> <p>This works just fine, since <code>np.NaN</code> can live in a float. This produces</p> <pre class="prettyprint"><code>a float64 b float64 c float64 dtype: object </code></pre> <p>And should take less memory.</p> <h3>More on how to relate to dtypes</h3> <p>See this related post for details on dtype: Pandas read_csv low_memory and dtype options</p>

Pandas read_csv on 6.5 GB file consumes more than 170GB RAM

Tags:

python

pandas

parsing

ipython

numpy

I wanted to bring this up, just because it's crazy weird. Maybe Wes has some idea. The file is pretty regular: 1100 rows x ~3M columns, data are tab-separated, consisting solely of the integers 0, 1, and 2. Clearly this is not expected.

If I prepopulate a dataframe as below, it consumes ~26GB of RAM.

h = open("ms.txt")
header = h.readline().split("\t")
h.close()
rows=1100
df = pd.DataFrame(columns=header, index=range(rows), dtype=int)

System info:

python 2.7.9
ipython 2.3.1
numpy 1.9.1
pandas 0.15.2.

Any ideas welcome.

362

asked Jan 29 '15 16:01

Chris F.

1 Answers

Problem of your example.

Trying your code on small scale, I notice even if you set dtype=int, you are actually ending up with dtype=object in your resulting dataframe.

header = ['a','b','c']
rows = 11
df = pd.DataFrame(columns=header, index=range(rows), dtype=int)

df.dtypes
a    object
b    object
c    object
dtype: object

This is because even though you give the pd.read_csv function the instruction that the columns are dtype=int, it cannot override the dtypes being ultimately determined by the data in the column.

This is because pandas is tightly coupled to numpy and numpy dtypes.

The problem is, there is no data in your created dataframe, thus numpy defaults the data to be np.NaN, which does not fit in an integer.

This means numpy gets confused and defaults back to the dtype being object.

Problem of the object dtype.

Having the dtype set to object means a big overhead in memory consumption and allocation time compared to if you would have the dtype set as integer or float.

Workaround for your example.

df = pd.DataFrame(columns=header, index=range(rows), dtype=float)

This works just fine, since np.NaN can live in a float. This produces

a    float64
b    float64
c    float64
dtype: object

And should take less memory.

More on how to relate to dtypes

See this related post for details on dtype: Pandas read_csv low_memory and dtype options

answered Sep 29 '22 09:09

firelynx

Related questions
                            
                                psycopg2 cannot connect to docker image
                            
                                IPC shared memory across Python scripts in separate Docker containers
                            
                                VS Code / Python / Debugging pytest Test with the Debugger
                            
                                Python complex event processing
                            
                                How to clean up after subprocess.Popen?
                            
                                Python decorator for automatic binding __init__ arguments
                            
                                How do I start and stop a Linux program using the subprocess module in Python?
                            
                                Overriding __getattr__ to support dynamic nested attributes
                            
                                Getting an embedded Python runtime to use the current active virtualenv
                            
                                Classifiers confidence in opencv face detector
                            
                                Git-backed ORM for Python?
                            
                                Apply automatic pep8 fixes from QuickFix window
                            
                                Sharing object (class instance) using multiprocessing Managers
                            
                                tracing memory leaks in Python (multiprocessing)
                            
                                Passing the library path as a command line argument to setup.py
                            
                                Django unable to load test fixtures, IntegrityError
                            
                                Import errors with Pycharm
                            
                                Community detection in Networkx
                            
                                Scipy -- 3d griddata -- Why is it necessary to cast griddata xi argument to tuple?
                            
                                Pairwise Set Intersection in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With