VERY huge SQL Database: How should the schema look like?

Tags:

I have 2 files that I'd like to import into MS SQL. The first file is 2.2 GB and the second file is 24 GB worth of data. (if you are curious: this is a poker related look up table)

Importing them into MS SQL is not a problem. Thanks to SqlBulkCopy I was able to import the first file in just 10 minutes. My problem is, I don't know how the actual table schema should look like to allow me to do some very fast queries. My first naive attempt looks like this:

CREATE TABLE [dbo].[tblFlopHands](
    [hand_id] [int] IDENTITY(1,1) NOT NULL,
    [flop_index] [smallint] NULL,
    [hand_index] [smallint] NULL,
    [hs1] [real] NULL,
    [ppot1] [real] NULL,
    [hs2] [real] NULL,
    [ppot2] [real] NULL,
    [hs3] [real] NULL,
    [ppot3] [real] NULL,
    [hs4] [real] NULL,
    [ppot4] [real] NULL,
    [hs5] [real] NULL,
    [ppot5] [real] NULL,
    [hs6] [real] NULL,
    [ppot6] [real] NULL,
    [hs7] [real] NULL,
    [ppot7] [real] NULL,
    [hs8] [real] NULL,
    [ppot8] [real] NULL,
    [hs9] [real] NULL,
    [ppot9] [real] NULL,
 CONSTRAINT [PK_tblFlopHands] PRIMARY KEY CLUSTERED 
(
    [hand_id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

The flop index is a value from 1 to 22100 (the first 3 common cards in texas hold'em, 52 choose 3). Each flop index has a hand_index from 1 to 1176 (49 choose 2). So in total there are 25,989,600 rows in this table.

Doing a query with my above "schema" took approx. 25 seconds. After some googling I found out that the SQL server was doing a table scan, which is obviously a bad thing. I ran the "Database Engine Tuning Advisor" and it recommended to create an index on the flop_index column (makes sense). After creating the index, the required disk spaces for the DB exactly doubled up! (plus the log LDF file grew by 2.6 GB) But after the indexing, a query took only a couple of ms.

Now my question is, how should I do it the right way? I've never worked with such massive data, the databases I created before were a joke.

Some things to note: After importing the data into MS SQL there will never ever be an insert or update of the data, just select's. So I'm wondering if I even need a primary key?

EDIT: I'm providing some more info to make my question more clear:

1) I'll never ever use the hand_id. I only put it there because someone told me some long time ago that I should always create a primary key for each table.

2) There will be basically only one query I will use:

SELECT hand_index, hs1, ppot1, hs2, ppot2, hs3, ppot3, hs4, ppot4, hs5, ppot5, hs6, ppot6, hs7, ppot7, hs8, ppot8, hs9, ppot9 WHERE flop_index = 1...22100

This query will always return 1176 rows with the data I need.

EDIT2: Just to be more specific: Yes this is static data. I have this data in a binary file. I have written a program to query this file with the data I need in just a few milliseconds. The reason I want this data in a database is that I want to be able to query the data from different computers in my network without the need to copy 25 GB on each computer.

HS means handstrength, it tells you the current hand strength of your hole cards combined with the flop or turn cards. ppot means positive potential, this is the chance that your hand will be ahead once the next common card is dealt. hs1 to 9 is the handstrength against 1 to 9 opponents. Same for ppot. Calculating ppot on the fly is very cpu intensive and takes a couple of minutes to calculate. I want to create a poker analysis program which gives me a list of every possible hole card combiniation on any give flop/turn with their hs/ppot.

985

asked Aug 31 '09 19:08

Simon

1 Answers

To answer your question about needing a primary key - with only the information you provided in the question:

Based on your table schema, you might as well keep it there. If you remove that identity column, you'd also be removing your clustered index. Your clustered index value (4 bytes) is stored as the pointer in each non-clustered index row. By removing that clustered index, you'd be leaving the table as a heap - and SQL will create an 8 byte RID (row identifier) for each row in the table, and use that as the pointer in the non-clustered index instead. So, in your case, based on the schema you've provided in the question - you could potentially INCREASE the size of your non-clustered indexes, and in the end slow them down.

With that all said - based on the queries that you could be running (and their usage patterns) that weren't included in the question - evaluating your clustered index to be something other than an identity column could be in line as well.

151

answered Nov 14 '22 23:11

Scott Ivey

Related questions
                            
                                Validation failed for one or more entities in Entity Framework for nullable boolean property
                            
                                update a table from insert result
                            
                                "Error op_response:0" with prepared statement
                            
                                SqlDataAdapter.Fill() vs DataTable.Load()
                            
                                Performance problems with EF core QueryFilter
                            
                                The best way to insert an integer into my table using previous/next buttons
                            
                                How to properly apply recursive CTE?
                            
                                date format issue with google data studio (postgres)
                            
                                Slow query optimisation in Postgres
                            
                                SELECT COUNT(free rooms categories in complex query)
                            
                                Sequelize Postgres - How to use ON CONFLICT for unique?
                            
                                Bug in MySql and MariaDB when having an index on utf8mb4 data and the substring ü😋?
                            
                                What is the difference in select with alias, and without- oracle 11g
                            
                                SAS macro for laboratory values
                            
                                Timestream - Pivot data per dimensions
                            
                                SqlDataReader.HasRows returns false since SQL 2008 upgrade
                            
                                What's the best approach for reading in XML data and building a query to insert the values into a SQL Server DB?
                            
                                ADO.NET TableAdapter parameters
                            
                                Scalable Database Tagging Schema

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

VERY huge SQL Database: How should the schema look like?

Tags:

sql

database

schema

Simon

People also ask

1 Answers

Scott Ivey

Recent Activity

Donate For Us