Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large SQL transaction: runs out of memory on PostgreSQL, yet works on SQL Server

I have decided to move my C# daemon application (using dotConnect as ADO.NET provider) from SQL Server 2008 R2 to PostgreSQL 9.0.4 x64 (on Windows Server 2008 R2). Therefore I slightly modified all queries to match PostgreSQL syntax and... got stuck on behavior which never happened with the same queries on SQL Server (not even on lowly Express edition).

Let's say the database contains 2 very simple tables without any relation to each other. They look somewhat like this: ID, Name, Model, ScanDate, Notes. I have a transformation process which reads data over TCP/IP, processes it, starts a transaction and puts the results into aforementioned 2 tables using vanilla INSERTs. The tables are initially empty; no BLOB columns. There are about 500.000 INSERTs on a bad day, all wrapped in a single transaction (and cannot be split into multiple transactions, btw). No SELECTs, UPDATEs or DELETEs are ever made. An example of INSERT (ID is bigserial - autoincremented automatically):

INSERT INTO logs."Incoming" ("Name", "Model", "ScanDate", "Notes")
VALUES('Ford', 'Focus', '2011-06-01 14:12:32', NULL)

SQL Server calmly accepts the load while maintaining a reasonable Working Set of ~200 MB. PostgreSQL, however, takes up additional 30 MB each second the transaction runs (!) and quickly exhausts system RAM.

I've done my RTFM and tried fiddling with postgresql.conf: setting "work_mem" to a minimum 64 kB (this slightly slowed down the RAM hogging), reducing "shared_buffers" / "temp_buffers" to minimum (no difference), - but to no avail. Reducing transaction isolation level to Read Uncommitted didn't help. There are no indexes except the one on ID BIGSERIAL (PK). SqlCommand.Prepare() makes no difference. No concurrent connections ever are established: daemon uses the database exclusively.

It may seem PostgreSQL cannot cope with mind-numbingly simple INSERT-fest, while SQL Server can do that. Maybe it's a PostgreSQL snapshot-vs-SQL Server locks isolation difference? It's a fact for me: vanilla SQL Server works, while neither vanilla nor tweaked PostgreSQL does.

What can I do to make PostgreSQL memory consumption to remain flat (as is apparently the case with SQL Server) while INSERT-based transaction runs?

EDIT: I have created an artificial testcase:

DDL:

CREATE TABLE sometable
(
  "ID" bigserial NOT NULL,
  "Name" character varying(255) NOT NULL,
  "Model" character varying(255) NOT NULL,
  "ScanDate" date NOT NULL,
  CONSTRAINT "PK" PRIMARY KEY ("ID")
)
WITH (
  OIDS=FALSE
);

C# (requires Devart.Data.dll & Devart.Data.PostgreSql.dll)

PgSqlConnection conn = new PgSqlConnection("Host=localhost; Port=5432; Database=testdb; UserId=postgres; Password=###########");
conn.Open();
PgSqlTransaction tx = conn.BeginTransaction(IsolationLevel.ReadCommitted);

for (int ii = 0; ii < 300000; ii++)
{
    PgSqlCommand cmd = conn.CreateCommand();
    cmd.Transaction = tx;
    cmd.CommandType = CommandType.Text;
    cmd.CommandText = "INSERT INTO public.\"sometable\" (\"Name\", \"Model\", \"ScanDate\") VALUES(@name, @model, @scanDate) RETURNING \"ID\"";
    PgSqlParameter parm = cmd.CreateParameter();
    parm.ParameterName = "@name";
    parm.Value = "SomeName";
    cmd.Parameters.Add(parm);

    parm = cmd.CreateParameter();
    parm.ParameterName = "@model";
    parm.Value = "SomeModel";
    cmd.Parameters.Add(parm);

    parm = cmd.CreateParameter();
    parm.ParameterName = "@scanDate";
    parm.PgSqlType = PgSqlType.Date;
    parm.Value = new DateTime(2011, 6, 1, 14, 12, 13);
    cmd.Parameters.Add(parm);

    cmd.Prepare();

    long newID = (long)cmd.ExecuteScalar();
}

tx.Commit();

This recreates the memory hogging. HOWEVER: if the 'cmd' variable is created and .Prepare()d outside the FOR loop, the memory does not increase! Apparently, preparing multiple PgSqlCommands with IDENTICAL SQL but different parameter values does not result in a single query plan inside PostgreSQL, like it does in SQL Server.

The problem remains: if one uses Fowler's Active Record dp to insert multiple new objects, prepared PgSqlCommand instance sharing is not elegant.

Is there a way/option to facilitate query plan reuse with multiple queries having identical structure yet different argument values?

UPDATE

I've decided to look at the simplest possible case - where a SQL batch is run directly on DBMS, without ADO.NET (suggested by Jordani). Surprisingly, PostgreSQL does not compare incoming SQL queries and does not reuse internal compiled plans - even when incoming query has the same identical arguments! For instance, the following batch:

PostgreSQL (via pgAdmin -> Execute query) -- hogs memory

BEGIN TRANSACTION;

INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
-- the same INSERT is repeated 100.000 times

COMMIT;

SQL Server (via Management Studio -> Execute) -- keeps memory usage flat

BEGIN TRANSACTION;

INSERT INTO [dbo].sometable ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
INSERT INTO [dbo].sometable ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
-- the same INSERT is repeated 100.000 times

COMMIT;

and the PostgreSQL log file (thanks, Sayap!) contains:

2011-06-05 16:06:29 EEST LOG:  duration: 0.000 ms  statement: set client_encoding to 'UNICODE'
2011-06-05 16:06:43 EEST LOG:  duration: 15039.000 ms  statement: BEGIN TRANSACTION;

INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
-- 99998 lines of the same as above
COMMIT;

Apparently, even after transmitting the whole query to the server as-is, the server cannot optimize it.

ADO.NET driver alternative

As Jordani suggested, I've tried NpgSql driver instead of dotConnect - with the same (lack of) results. However, Npgsql source for .Prepare() method contains such enlightening lines:

planName = m_Connector.NextPlanName();
String portalName = m_Connector.NextPortalName();
parse = new NpgsqlParse(planName, GetParseCommandText(), new Int32[] { });
m_Connector.Parse(parse);

The new content in the log file:

2011-06-05 15:25:26 EEST LOG:  duration: 0.000 ms  statement: BEGIN; SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
2011-06-05 15:25:26 EEST LOG:  duration: 1.000 ms  parse npgsqlplan1: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST LOG:  duration: 0.000 ms  bind npgsqlplan1: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST DETAIL:  parameters: $1 = 'SomeName', $2 = 'SomeModel', $3 = '2011-06-01'
2011-06-05 15:25:26 EEST LOG:  duration: 1.000 ms  execute npgsqlplan1: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST DETAIL:  parameters: $1 = 'SomeName', $2 = 'SomeModel', $3 = '2011-06-01'
2011-06-05 15:25:26 EEST LOG:  duration: 0.000 ms  parse npgsqlplan2: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST LOG:  duration: 0.000 ms  bind npgsqlplan2: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST DETAIL:  parameters: $1 = 'SomeName', $2 = 'SomeModel', $3 = '2011-06-01'
2011-06-05 15:25:26 EEST LOG:  duration: 0.000 ms  execute npgsqlplan2: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST DETAIL:  parameters: $1 = 'SomeName', $2 = 'SomeModel', $3 = '2011-06-01'
2011-06-05 15:25:26 EEST LOG:  duration: 0.000 ms  parse npgsqlplan3: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"

Inefficiency is quite obvious in this log excerpt...

Conclusions (such as they are)

Frank's note about WAL is another awakening: something else to configure that SQL Server hides away from a typical MS developer.

NHibernate (even in its simplest usage) reuses prepared SqlCommands properly...if only it was used from the start...

it is obvious that an architectural difference exists between SQL Server and PostgreSQL, and the code specifically built for SQL Server (and thus blissfully unaware of the 'unable-to-reuse-identical-sql' possibility) will not work efficiently on PostgreSQL without major refactoring. And refactoring 130+ legacy ActiveRecord classes to reuse prepared SqlCommand objects in a messy multithreaded middleware is not a 'just-replace-dbo-with-public'-type affair.

Unfortunately for my overtime, Eevar's answer is correct :)

Thanks to everyone who pitched in!

like image 389
Proglamer Avatar asked Jun 04 '11 17:06

Proglamer


People also ask

What happens when Postgres runs out of memory?

The most common cause of out of memory issue happens when PostgreSQL is unable to allocate the memory required for a query to run. This is defined by work_mem parameter, which sets the maximum amount of memory that can be used by a query operation before writing to temporary disk files.

Can PostgreSQL handle large data?

PostgreSQL does not impose a limit on the total size of a database. Databases of 4 terabytes (TB) are reported to exist. A database of this size is more than sufficient for all but the most demanding applications.

How big can a Postgres transaction be?

The maximum transaction size is like 2-4 billion, but just to not be excessive I would cut it off at a 2 billion rows per transaction.


2 Answers

Reducing work_mem and shared_buffers is not a good idea, databases (including PostgreSQL) love RAM.

But this might not be your biggest problem, what about the WAL-settings? wal_buffers should be large enough to hold the entire transaction, all 500k INSERT's. What is the current setting? And what about checkpoint_segments?

500k INSERT's should not be a problem, PostgreSQL can handle this without memory problems.

http://www.postgresql.org/docs/current/interactive/runtime-config-wal.html

like image 150
Frank Heikens Avatar answered Oct 19 '22 23:10

Frank Heikens


I suspect you figured it out yourself. You're probably creating 500k different prepared statements, query plans and all. Actually, it's worse than that; prepared statements live outside of transaction boundaries and persist until the connection is closed. Abusing them like this will drain plenty of memory.

If you want to execute a query several times but avoid the planning overhead for each execution, create a single prepared statement and reuse that with new parameters.

If your queries are unique and ad-hoc, just use postgres' normal support for bind variables; no need for the extra overhead from prepared statements.

like image 40
eevar Avatar answered Oct 19 '22 23:10

eevar