Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory leak from pyarrow?

For the parsing of a larger file, I need to write in a loop to a large number of parquet files successively. However, it appears that the memory consumed by this task increases over each iteration, whereas I would expect it to remain constant (as nothing should be appended in memory). This makes it tricky to scale.

I've added a minimum reproducible example which creates 10 000 parquet and loop appends to it.

import resource
import random
import string
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))

schema = pa.schema([
                        pa.field('test', pa.string()),
                    ])

resource.setrlimit(resource.RLIMIT_NOFILE, (1000000, 1000000))
number_files = 10000
number_rows_increment = 1000
number_iterations = 100

writers = [pq.ParquetWriter('test_'+id_generator()+'.parquet', schema) for i in range(number_files)]

for i in range(number_iterations):
    for writer in writers:
        table_to_write = pa.Table.from_pandas(
                            pd.DataFrame({'test': [id_generator() for i in range(number_rows_increment)]}),
                            preserve_index=False,
                            schema = schema,
                            nthreads = 1)
        table_to_write = table_to_write.replace_schema_metadata(None)
        writer.write_table(table_to_write)
    print(i)

for writer in writers:
    writer.close()

Would anyone have any idea what causes this leak and how to prevent it?

like image 344
Abel Riboulot Avatar asked Oct 16 '22 11:10

Abel Riboulot


1 Answers

We aren't sure what is wrong, but some other users have reported as yet undiagnosed memory leaks. I added your example to one of the tracking JIRA issues https://issues.apache.org/jira/browse/ARROW-3324

like image 110
Wes McKinney Avatar answered Oct 21 '22 04:10

Wes McKinney