The question is aimed at MySQL 5.5 running on Ubuntu 10.04 LTS Server with the default InnoDB table type...
Let's say I have a table "Address" of house addresses with columns "number", "street", "district", "town", "county" and "postcode". I'm going to have many rows with the same values in these columns, and I'm going to index them all individually for searching. Let's say I implement each column as VARCHAR(127) and create 1000 rows all with town='London'. Does that mean I end up with 1000 copies of the string 'London' in my database, or does MySQL do something clever and store the string only once, then reference that single copy from all 1000 rows?
The thing I've been doing up to now is explicitly handling duplicates by creating separate tables for each of these columns, each with "id" and "value" columns, then using foreign keys in the Address table to reference the unique value in each table. Each time I insert a new Address row I search each table to see if the number, street, district etc already exists. If it does I use the existing index, if it doesn't then I insert a row in that table and use the new index.
Clearly my approach minimises the number of VARCHAR strings stored as there's only one copy of each duplicate. The question is, does MySQL do something the same (or better!) if I simply declare the columns as VARCHAR and index them?
You will get 1000 copies of "London". In a VARCHAR(127)
, each copy will consume 1 or 2 bytes for the length, plus 6 bytes for "London". Think of it this way... The overhead for pointing to the single copy, etc, might be bulkier (on average) than the savings.
If you are talking about "prefix de-dupping" in indexes, that is not done, but has been suggested. That's actually a more general way of saving space, but it applies only for index-like structures.
(This answer applies to all versions of MySQL, all common Engines, all CHARACTER SETs
.)
Look for a "Column Store", such as InfiniDB.
Also, TokuDB, InnoDB with ROW_FORMAT=COMPRESSED
, FusionIO, etc, will use compression techniques to decrease disk usage. Those do not de-dup as you described.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With