Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble with PostgreSQL loading a large csv file into a table

Tags:

csv

postgresql

On my setup, PostgreSQL 9.2.2 seems to error out when trying to load a large csv file into a table.

The size of the csv file is ~9GB

Here's the SQL statement I'm using to do the bulk load:

copy chunksBase (chunkId, Id, chunk, chunkType) from path-to-csv.csv' delimiters ',' csv

Here's the error I get after a few minutes:

pg.ProgrammingError: ERROR:  out of memory
DETAIL:  Cannot enlarge string buffer containing 1073723635 bytes by 65536 more bytes.
CONTEXT:  COPY chunksbase, line 47680536

I think that the buffer can't allocate more than exactly 1GB, which makes me think that this could be a postgresql.conf issue.

Here's the uncommented lines in postgresql.conf:

bash-3.2# cat postgresql.conf | perl -pe 's/^[ \t]*//' | grep -v '^#' | sed '/^$/d'
log_timezone = 'US/Central'
datestyle = 'iso, mdy'
timezone = 'US/Central'
lc_messages = 'en_US.UTF-8'         # locale for system error message
lc_monetary = 'en_US.UTF-8'         # locale for monetary formatting
lc_numeric = 'en_US.UTF-8'          # locale for number formatting
lc_time = 'en_US.UTF-8'             # locale for time formatting
default_text_search_config = 'pg_catalog.english'
default_statistics_target = 50 # pgtune wizard 2012-12-02
maintenance_work_mem = 768MB # pgtune wizard 2012-12-02
constraint_exclusion = on # pgtune wizard 2012-12-02
checkpoint_completion_target = 0.9 # pgtune wizard 2012-12-02
effective_cache_size = 9GB # pgtune wizard 2012-12-02
work_mem = 72MB # pgtune wizard 2012-12-02
wal_buffers = 8MB # pgtune wizard 2012-12-02
checkpoint_segments = 16 # pgtune wizard 2012-12-02
shared_buffers = 3GB # pgtune wizard 2012-12-02
max_connections = 80 # pgtune wizard 2012-12-02
bash-3.2# 

Nothing that explicitly sets a buffer to 1GB.

What's going on here? Even if the solution is to increase a buffer in postgresql.conf, why is postgres seeming to try and bulk load an entire csv file into ram on the single copy call? One would think that loading large csv files is a common task; I can't be the first person to come across this problem; so I would figure that postgres would have handled chunking the bulk load so that the buffer limit was never reached in the first place.

As a workaround, I'm splitting the csv into smaller files, and then calling copy for each file. This seems to be working fine. But it's not a particularly satisfying solution, because now I have to maintain split versions of each large csv that I want to load into postgres. There has to be a more proper way to bulk load a large csv file into postgres.

EDIT1: I am in the process of making sure that the csv file is not malformed in any way. I'm doing this by trying to load all split csv files into postgres. If all can be loaded, then this indicates that the issue here is not likely due to the csv file being malformed. I've already found a few issues. Not sure yet if these issues are causing the string buffer error when trying to load the large csv.

like image 417
Clayton Stanley Avatar asked Dec 16 '12 21:12

Clayton Stanley


People also ask

How do I open a CSV file that is too large?

So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.


1 Answers

It turned out to be a malformed csv file.

I split the large csv into smaller chunks (each with 1 million rows) and started loading each one into postgres.

I started getting more informative errors:

pg.ProgrammingError: ERROR:  invalid byte sequence for encoding "UTF8": 0x00
CONTEXT:  COPY chunksbase, line 15320779

pg.ProgrammingError: ERROR:  invalid byte sequence for encoding "UTF8": 0xe9 0xae 0x22
CONTEXT:  COPY chunksbase, line 369513

pg.ProgrammingError: ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xaf 0x80
CONTEXT:  COPY chunksbase, line 16602

There were a total of 5 rows with invalid utf8 byte sequences, out of a few hundred million. After removing those rows, the large 9GB csv loaded just fine.

It would have been nice to get the invalid byte sequence errors when loading the large file initially. But at least they appeared once I started isolating the problem.

Note that the line number mentioned in the error when loading the large file initially, had no relation with the encoding errors that were found when loading the smaller csv subset files. The initial line number was the point in the file where exactly 1GB of data took place, so it was related to the 1GB buffer allocation error. But, that error had nothing to do with the real problem...

like image 137
Clayton Stanley Avatar answered Sep 22 '22 11:09

Clayton Stanley