Generating a massive 150M-row MySQL table

Tags:

I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.

The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.

ALTER TABLE memtable ENGINE = MyISAM

I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.

If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.

One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.

At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.

I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?

Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.

819

asked Oct 19 '09 00:10

brianreavis

1 Answers

If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.

You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.

Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.

184

answered Oct 05 '22 13:10

Brenton Alker

Related questions
                            
                                .NET: How do you decide how to structure your Namespaces?
                            
                                WPF Architecture and Direct3D graphics acceleration
                            
                                .Net XML Serialization based on an XSD?
                            
                                Stringbuilder in CIL (MSIL)
                            
                                Makefile variables from command line vs. environment
                            
                                Performance penalty with executing x86 instructions stored in the data segment?
                            
                                Get Birthday reminder Linq Query ignoring year
                            
                                Creating SVG graphics using a server side script (PHP)
                            
                                inbreeding coefficient calculation and genealogical software
                            
                                How to attach an event to onSubmit event of form with chaining earlier attached methods as well?
                            
                                Python strategy for extracting text from malformed html pages
                            
                                Does the RPM Epoch header have any limitations?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With