Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is Postgres better than MySql when one needs to add a column to a table with millions of rows?

We're having problems with Mysql. When I search around, I see many people having the same problem.

I have joined up with a product where the database has some tables with as many as 150 million rows. One example of our problem is that one of these tables has over 30 columns and about half of them are no longer used. When trying to remove columns or renaming columns, mysql wants to copy the entire table and rename. With this amount of data, it would take many hours to do this and the site would be offline pretty much the whole time. This is just the first of several large migrations to improve the schema. These aren't intended as a regular thing. Just a lot of cleanup I inherited.

I tried searching to see if people have the same problem with Postgres and I find almost nothing in comparison talking about this issue. Is this because Postgres is a lot better at it, or just that less people are using postgres?

like image 663
Andrew Avatar asked Dec 22 '10 22:12

Andrew


2 Answers

In PostgreSQL, adding a new column without default value to a table is instantaneous, because the new column is only registered in the system catalog, not actually added on disk.

like image 83
Peter Eisentraut Avatar answered Oct 10 '22 11:10

Peter Eisentraut


When the only tool you know is a hammer, all your problems look like a nail. For this problem, PostgreSQL is much much better at handling these types of changes. And the fact is, it doesn't matter how well you designed your app, you WILL have to change the schema on a live database someday. While MySQL's various engines really are amazing for certain corner cases, here none of them help. PostgreSQL's very close integration between the various layers means that you can have things like transactional ddl that allow you to roll back anything that isn't an alter / create database / tablespace. Or very very fast alter tables. Or non-impeding create indexes. And so on. It limits PostgreSQL to the things it does well (traditional transactional db load handling is a strong point) and not so great at the things that MySQL often fills in the gaps on, like live networked clustered storage with the ndb engine.

In this case none of the different engines in MySQL allow you to easily solve this problem. The very versatility of multiple storage engines means that the lexer / parser / top layer of the DB cannot be as tightly integrated to the storage engines, and therefore a lot of the cool things pgsql can do here mysql can't.

I've got a 118Gigabyte table in my stats db. It has 1.1 billion rows in it. It really should be partitioned but it's not read a whole lot, and when it is we can wait on it. At 300MB/sec (the speed the array it's on can read) it takes approximately 118*~3seconds to read, or right around 5 minutes. This machine has 32Gigs of RAM, so it cannot hold the table in memory.

When I ran the simple statement on this table:

alter table mytable add test text;

it hung waiting for a vacuum. I killed the vacuum (select pg_cancel_backend(12345) (<-- pid in there) and it finished immediately. A vacuum on this table takes a long time to run btw. Normally it's not a big deal, but when making changes to table structure, you have to wait on vacuums, or kill them.

Dropping a column is just as simple and fast.

Now we come to the problem with postgresql, and that is the in-heap MVCC storage. If you add that column, then do an update table set test='abc' it updates each row, and exactly doubles the size of the table. Unless HOT can update the rows in place, but then you need a 50% fill factor table which is double sized to begin with. The only way to get the space back is to either wait and let vacuum reclaim it over time and reuse it one update at a time, or to run cluster or vacuum full to shrink it back down.

you can get around this by running updates on parts of the table at a time (update where pkid between 1 and 10000000; ...) and running vacuum between each run to reclaim the space.

So, both systems have warts and bumps to deal with.

like image 40
Scott Marlowe Avatar answered Oct 10 '22 11:10

Scott Marlowe