Trouble with PostgreSQL loading a large csv file into a table

Tags:

postgresql

On my setup, PostgreSQL 9.2.2 seems to error out when trying to load a large csv file into a table.

The size of the csv file is ~9GB

Here's the SQL statement I'm using to do the bulk load:

copy chunksBase (chunkId, Id, chunk, chunkType) from path-to-csv.csv' delimiters ',' csv

Here's the error I get after a few minutes:

pg.ProgrammingError: ERROR:  out of memory
DETAIL:  Cannot enlarge string buffer containing 1073723635 bytes by 65536 more bytes.
CONTEXT:  COPY chunksbase, line 47680536

I think that the buffer can't allocate more than exactly 1GB, which makes me think that this could be a postgresql.conf issue.

Here's the uncommented lines in postgresql.conf:

bash-3.2# cat postgresql.conf | perl -pe 's/^[ \t]*//' | grep -v '^#' | sed '/^$/d'
log_timezone = 'US/Central'
datestyle = 'iso, mdy'
timezone = 'US/Central'
lc_messages = 'en_US.UTF-8'         # locale for system error message
lc_monetary = 'en_US.UTF-8'         # locale for monetary formatting
lc_numeric = 'en_US.UTF-8'          # locale for number formatting
lc_time = 'en_US.UTF-8'             # locale for time formatting
default_text_search_config = 'pg_catalog.english'
default_statistics_target = 50 # pgtune wizard 2012-12-02
maintenance_work_mem = 768MB # pgtune wizard 2012-12-02
constraint_exclusion = on # pgtune wizard 2012-12-02
checkpoint_completion_target = 0.9 # pgtune wizard 2012-12-02
effective_cache_size = 9GB # pgtune wizard 2012-12-02
work_mem = 72MB # pgtune wizard 2012-12-02
wal_buffers = 8MB # pgtune wizard 2012-12-02
checkpoint_segments = 16 # pgtune wizard 2012-12-02
shared_buffers = 3GB # pgtune wizard 2012-12-02
max_connections = 80 # pgtune wizard 2012-12-02
bash-3.2#

Nothing that explicitly sets a buffer to 1GB.

What's going on here? Even if the solution is to increase a buffer in postgresql.conf, why is postgres seeming to try and bulk load an entire csv file into ram on the single copy call? One would think that loading large csv files is a common task; I can't be the first person to come across this problem; so I would figure that postgres would have handled chunking the bulk load so that the buffer limit was never reached in the first place.

As a workaround, I'm splitting the csv into smaller files, and then calling copy for each file. This seems to be working fine. But it's not a particularly satisfying solution, because now I have to maintain split versions of each large csv that I want to load into postgres. There has to be a more proper way to bulk load a large csv file into postgres.

EDIT1: I am in the process of making sure that the csv file is not malformed in any way. I'm doing this by trying to load all split csv files into postgres. If all can be loaded, then this indicates that the issue here is not likely due to the csv file being malformed. I've already found a few issues. Not sure yet if these issues are causing the string buffer error when trying to load the large csv.

417

asked Dec 16 '12 21:12

Clayton Stanley

1 Answers

It turned out to be a malformed csv file.

I split the large csv into smaller chunks (each with 1 million rows) and started loading each one into postgres.

I started getting more informative errors:

pg.ProgrammingError: ERROR:  invalid byte sequence for encoding "UTF8": 0x00
CONTEXT:  COPY chunksbase, line 15320779

pg.ProgrammingError: ERROR:  invalid byte sequence for encoding "UTF8": 0xe9 0xae 0x22
CONTEXT:  COPY chunksbase, line 369513

pg.ProgrammingError: ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xaf 0x80
CONTEXT:  COPY chunksbase, line 16602

There were a total of 5 rows with invalid utf8 byte sequences, out of a few hundred million. After removing those rows, the large 9GB csv loaded just fine.

It would have been nice to get the invalid byte sequence errors when loading the large file initially. But at least they appeared once I started isolating the problem.

Note that the line number mentioned in the error when loading the large file initially, had no relation with the encoding errors that were found when loading the smaller csv subset files. The initial line number was the point in the file where exactly 1GB of data took place, so it was related to the 1GB buffer allocation error. But, that error had nothing to do with the real problem...

137

answered Sep 22 '22 11:09

Clayton Stanley

Related questions
                            
                                Speeding up PostgreSQL query where data is between two dates
                            
                                ASP.NET MVC image upload store location (db vs filesystem)
                            
                                How to check permissions to functions under psql console
                            
                                Postgresql and comparing to an empty field
                            
                                Postgres UTF-8 clobs with JDBC
                            
                                Control order of updates in hibernate
                            
                                Postgres sequences without an 'owned by' attribute do not return an id in Django 1.3
                            
                                Common Table Expression in Sub-Query
                            
                                Getting an ID inside a PostgreSQL transaction block
                            
                                Recommended hosting for PostGIS app [closed]
                            
                                Bulk upsert with Ruby on Rails
                            
                                currval Function in PostgreSQL complaining that "column does not exist"
                            
                                Function as parameter to another function in Postgres
                            
                                Postgres Indexing?
                            
                                How to speed up a slow UPDATE query
                            
                                Convert bigint to bytea, but swap the byte order
                            
                                Returning a nested composite type from a PL/pgSQL function
                            
                                Postgresql with Npgsql "relation <tablename> does not exist." Only occurring on Win 7
                            
                                Heroku - dump and load single table to shared postgres database
                            
                                PostgreSQL: WHERE IN and NOT WHERE IN

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Trouble with PostgreSQL loading a large csv file into a table

Tags:

csv

postgresql

Clayton Stanley

People also ask

1 Answers

Clayton Stanley

Recent Activity

Donate For Us