Memory leak from pyarrow?

Question

For the parsing of a larger file, I need to write in a loop to a large number of parquet files successively. However, it appears that the memory consumed by this task increases over each iteration, whereas I would expect it to remain constant (as nothing should be appended in memory). This makes it tricky to scale.

I've added a minimum reproducible example which creates 10 000 parquet and loop appends to it.

import resource
import random
import string
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))

schema = pa.schema([
                        pa.field('test', pa.string()),
                    ])

resource.setrlimit(resource.RLIMIT_NOFILE, (1000000, 1000000))
number_files = 10000
number_rows_increment = 1000
number_iterations = 100

writers = [pq.ParquetWriter('test_'+id_generator()+'.parquet', schema) for i in range(number_files)]

for i in range(number_iterations):
    for writer in writers:
        table_to_write = pa.Table.from_pandas(
                            pd.DataFrame({'test': [id_generator() for i in range(number_rows_increment)]}),
                            preserve_index=False,
                            schema = schema,
                            nthreads = 1)
        table_to_write = table_to_write.replace_schema_metadata(None)
        writer.write_table(table_to_write)
    print(i)

for writer in writers:
    writer.close()

Would anyone have any idea what causes this leak and how to prevent it?

Wes McKinney · Accepted Answer

We aren't sure what is wrong, but some other users have reported as yet undiagnosed memory leaks. I added your example to one of the tracking JIRA issues https://issues.apache.org/jira/browse/ARROW-3324

Memory leak from pyarrow?

Tags:

python

pandas

parquet

pyarrow

Abel Riboulot

1 Answers

Wes McKinney

Recent Activity

Donate For Us

Memory leak from pyarrow?

Tags:

python

pandas

parquet

pyarrow

Abel Riboulot

1 Answers

Wes McKinney

Related questions

Recent Activity

Donate For Us