I have 2 files that I'd like to import into MS SQL. The first file is 2.2 GB and the second file is 24 GB worth of data. (if you are curious: this is a poker related look up table)
Importing them into MS SQL is not a problem. Thanks to SqlBulkCopy I was able to import the first file in just 10 minutes. My problem is, I don't know how the actual table schema should look like to allow me to do some very fast queries. My first naive attempt looks like this:
CREATE TABLE [dbo].[tblFlopHands]( [hand_id] [int] IDENTITY(1,1) NOT NULL, [flop_index] [smallint] NULL, [hand_index] [smallint] NULL, [hs1] [real] NULL, [ppot1] [real] NULL, [hs2] [real] NULL, [ppot2] [real] NULL, [hs3] [real] NULL, [ppot3] [real] NULL, [hs4] [real] NULL, [ppot4] [real] NULL, [hs5] [real] NULL, [ppot5] [real] NULL, [hs6] [real] NULL, [ppot6] [real] NULL, [hs7] [real] NULL, [ppot7] [real] NULL, [hs8] [real] NULL, [ppot8] [real] NULL, [hs9] [real] NULL, [ppot9] [real] NULL, CONSTRAINT [PK_tblFlopHands] PRIMARY KEY CLUSTERED ( [hand_id] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]
The flop index is a value from 1 to 22100 (the first 3 common cards in texas hold'em, 52 choose 3). Each flop index has a hand_index from 1 to 1176 (49 choose 2). So in total there are 25,989,600 rows in this table.
Doing a query with my above "schema" took approx. 25 seconds. After some googling I found out that the SQL server was doing a table scan, which is obviously a bad thing. I ran the "Database Engine Tuning Advisor" and it recommended to create an index on the flop_index column (makes sense). After creating the index, the required disk spaces for the DB exactly doubled up! (plus the log LDF file grew by 2.6 GB) But after the indexing, a query took only a couple of ms.
Now my question is, how should I do it the right way? I've never worked with such massive data, the databases I created before were a joke.
Some things to note: After importing the data into MS SQL there will never ever be an insert or update of the data, just select's. So I'm wondering if I even need a primary key?
EDIT: I'm providing some more info to make my question more clear:
1) I'll never ever use the hand_id. I only put it there because someone told me some long time ago that I should always create a primary key for each table.
2) There will be basically only one query I will use:
SELECT hand_index, hs1, ppot1, hs2, ppot2, hs3, ppot3, hs4, ppot4, hs5, ppot5, hs6, ppot6, hs7, ppot7, hs8, ppot8, hs9, ppot9 WHERE flop_index = 1...22100
This query will always return 1176 rows with the data I need.
EDIT2: Just to be more specific: Yes this is static data. I have this data in a binary file. I have written a program to query this file with the data I need in just a few milliseconds. The reason I want this data in a database is that I want to be able to query the data from different computers in my network without the need to copy 25 GB on each computer.
HS means handstrength, it tells you the current hand strength of your hole cards combined with the flop or turn cards. ppot means positive potential, this is the chance that your hand will be ahead once the next common card is dealt. hs1 to 9 is the handstrength against 1 to 9 opponents. Same for ppot. Calculating ppot on the fly is very cpu intensive and takes a couple of minutes to calculate. I want to create a poker analysis program which gives me a list of every possible hole card combiniation on any give flop/turn with their hs/ppot.
A schema contains schema objects, which could be tables, columns, data types, views, stored procedures, relationships, primary keys, foreign keys, etc. A database schema can be represented in a visual diagram, which shows the database objects and their relationship with each other.
Techopedia Explains Very Large Database (VLDB) VLDB is primarily an enterprise class database. Although there is no specific limitation of a VLDB, it can consist of billions of records and have a cumulative size in thousands of gigabytes, or even some hundred terabytes.
A good database design is, therefore, one that: Divides your information into subject-based tables to reduce redundant data. Provides Access with the information it requires to join the information in the tables together as needed. Helps support and ensure the accuracy and integrity of your information.
To answer your question about needing a primary key - with only the information you provided in the question:
Based on your table schema, you might as well keep it there. If you remove that identity column, you'd also be removing your clustered index. Your clustered index value (4 bytes) is stored as the pointer in each non-clustered index row. By removing that clustered index, you'd be leaving the table as a heap - and SQL will create an 8 byte RID (row identifier) for each row in the table, and use that as the pointer in the non-clustered index instead. So, in your case, based on the schema you've provided in the question - you could potentially INCREASE the size of your non-clustered indexes, and in the end slow them down.
With that all said - based on the queries that you could be running (and their usage patterns) that weren't included in the question - evaluating your clustered index to be something other than an identity column could be in line as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With