Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

One 400GB table, One query - Need Tuning Ideas (SQL2005)

I have a single large table which I would like to optimize. I'm using MS-SQL 2005 server. I'll try to describe how it is used and if anyone has any suggestions I would appreciate it very much.

The table is about 400GB, has 100 million rows and 1 million rows are inserted each day. The table has 8 columns, 1 data col and 7 columns used for lookups/ordering.

 k1 k2 k3 k4 k5 k6 k7 d1

where

 k1: varchar(3), primary key - clustered index, 10 possible values
 k2: bigint, primary key - clustered index, total rows/10 possible values
 k3: int, 10 possible values
 k4: money, 100 possible values
 k5: bool
 k6: bool
 k7: DateTime

Only one select query is run which looks like this:

 SELECT TOP(g) d1 FROM table WITH(NOLOCK)
  WHERE k1 = a
  AND k3 = c
  AND k4 = d
  AND k5 = e
  AND k6 = f
  ORDER BY k7

where g = circa 1 million This query us ran about 10 times per day (often while inserts are happening) and takes about 5-30 minutes.

So I currently only have a clustered index on the two primary key columns. My question is: what indexes should I add to improve this query's performance?

Would separate indexes on every column be a good choice? I think a single index would take up about 5-8GB. The DB server has 8GB RAM total.

Please do not say that the best thing is to experiment. This is akin to 'I don't know, work it out your self' :)

Any tips much appreciated!


EDIT by doofledorfer--

You've caused an outbreak of premature optimization here, if not outright suggestions that "the best thing is to experiment". You need to clarify a number of issues if you want useful help.

-- doofledorfer


EDIT: Comments on posts to date are now posted below along with query plan - Mr. Flibble

You are probably I/O bound

Yes, it is not CPU bound. Disk access is high. All available RAM seems to be used. Whether it is used wisely or not remains to be seen.

You say you can't split the data because all the data is used: IMPOSSIBLE

I mean that all data is used at some point - not that all data is used by each user in each query. I can certainly split the data but, so far, I don't understand why partitioning the table is any better than using a clustered index.

Why did you choose these types VARCHAR probably should have been INT as it can only be a few values. The rest are sensible enough, Money represents a money value in real life and bigint is an ID, and the bools are onny, offy type things :)

By any chance we could get have a look the insert statement, or TSQL or the bulkinsert

TSQL. Its basically INSERT INTO table VALUES (k1,k2,k3,k4,k5,k6,d1). The only thing that is in any way interesting is that many duplicate inserts are attempted and the k1 & k2 PK constraint is used to prevent duplicate data entering the database. I believed at design time (and now) that this was as quick a way as any to finter out duplicate data.

Can you tell how often your insert happens Every 10 minutes or so inserts run (ADO.NET) maybe 10K at a time and take a few minutes. I estimate currently a full day's inserts take 40% of the time in the day.

Does the DateTime field contains the date of insert No. There is actually another DateTime column which does but it is not retrieved in any SELECT query so I didn't mention it for the sake of simplicity.

How did you came to this More one man day thinking.

if you're interested only in the last data, deleting/archiving the useless data could make sense (start from scratch every morning)

I am not interested in recent data only. A query may select some of the very first data that was inserted into the table all the way up to data inserted minutes ago. But as the data is filtered this does not mean that all the data in the DB is requested in that query.

if there is only one "inserter" and only one "reader", you may want to switch to a specialised type (hashmap/list/deque/stack) or something more elaborated, in a programming language.

I will probably stick with MSSQL for the moment. It's not broke yet, just a little slow.

liggett78, do you suggest a clustered index on columns k1,k4,k5,k6,k3 or a non-clustered index on those columns?


My main question right now is should I extend the current clustered index to contain k4 also (this is the col with next most possible values) or should I just add a non-clustered index to k4.

Would adding all k1-k6 to a clustered index be an option? Then have a separate non-clustered index on the DateTime column for the ORDER BY? Am I correct in thinking that this would not cause any major increase in DB size but will only affect insert times. Can anyone guesstimate the effect this will have on inserts?

I think that if adding indexes to all the columns will double the DB size then it is not viable without large (ie. hardware) changes.


The following plan was run with an index (non clustered) on the DATE column.

EDIT: Not sure if you can see the XML below so here is a link to it: http://conormccarthy.com/box/queryplan.sqlplan.txt

<?xml version="1.0" encoding="utf-16"?>
<ShowPlanXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" Version="1.0" Build="9.00.1399.06" xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan">
  <BatchSequence>
    <Batch>
      <Statements>
        <StmtSimple StatementCompId="1" StatementEstRows="11111" StatementId="1" StatementOptmLevel="FULL" StatementSubTreeCost="625.754" StatementText="SELECT TOP(11111) d1 FROM hands WITH (NOLOCK) &#xD;&#xA;                                WHERE k4 = '10' &#xD;&#xA;                                AND k6 = 1 &#xD;&#xA;                                AND k5 = 1  &#xD;&#xA;                                AND k1 = 'IPN'  &#xD;&#xA;                                AND k3 BETWEEN 2 AND 10  &#xD;&#xA;                                ORDER BY k7 DESC&#xD;&#xA;&#xD;&#xA;" StatementType="SELECT">
          <StatementSetOptions ANSI_NULLS="false" ANSI_PADDING="false" ANSI_WARNINGS="false" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="false" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="false" />
          <QueryPlan DegreeOfParallelism="1" CachedPlanSize="36">
            <MissingIndexes>
              <MissingIndexGroup Impact="81.7837">
                <MissingIndex Database="[MYDB]" Schema="[dbo]" Table="[Hands]">
                  <ColumnGroup Usage="EQUALITY">
                    <Column Name="[k1]" ColumnId="1" />
                    <Column Name="[k4]" ColumnId="7" />
                    <Column Name="[k5]" ColumnId="9" />
                    <Column Name="[k6]" ColumnId="10" />
                  </ColumnGroup>
                  <ColumnGroup Usage="INEQUALITY">
                    <Column Name="[k3]" ColumnId="6" />
                  </ColumnGroup>
                  <ColumnGroup Usage="INCLUDE">
                    <Column Name="[d1]" ColumnId="3" />
                    <Column Name="[k7]" ColumnId="4" />
                  </ColumnGroup>
                </MissingIndex>
              </MissingIndexGroup>
            </MissingIndexes>
            <RelOp AvgRowSize="75" EstimateCPU="0.0011111" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimateRows="11111" LogicalOp="Top" NodeId="0" Parallel="false" PhysicalOp="Top" EstimatedTotalSubtreeCost="625.754">
              <OutputList>
                <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="d1" />
              </OutputList>
              <RunTimeInformation>
                <RunTimeCountersPerThread Thread="0" ActualRows="11111" ActualEndOfScans="1" ActualExecutions="1" />
              </RunTimeInformation>
              <Top RowCount="false" IsPercent="false" WithTies="false">
                <TopExpression>
                  <ScalarOperator ScalarString="(11111)">
                    <Const ConstValue="(11111)" />
                  </ScalarOperator>
                </TopExpression>
                <RelOp AvgRowSize="83" EstimateCPU="135.557" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimateRows="11111" LogicalOp="Filter" NodeId="1" Parallel="false" PhysicalOp="Filter" EstimatedTotalSubtreeCost="625.753">
                  <OutputList>
                    <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="d1" />
                    <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k7" />
                  </OutputList>
                  <RunTimeInformation>
                    <RunTimeCountersPerThread Thread="0" ActualRows="11111" ActualEndOfScans="0" ActualExecutions="1" />
                  </RunTimeInformation>
                  <Filter StartupExpression="false">
                    <RelOp AvgRowSize="96" EstimateCPU="318.331" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimateRows="195691" LogicalOp="Inner Join" NodeId="2" Parallel="false" PhysicalOp="Nested Loops" EstimatedTotalSubtreeCost="625.404">
                      <OutputList>
                        <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="d1" />
                        <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k7" />
                        <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k3" />
                        <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k4" />
                        <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k5" />
                        <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k6" />
                      </OutputList>
                      <RunTimeInformation>
                        <RunTimeCountersPerThread Thread="0" ActualRows="341958" ActualEndOfScans="0" ActualExecutions="1" />
                      </RunTimeInformation>
                      <NestedLoops Optimized="false" WithOrderedPrefetch="true">
                        <OuterReferences>
                          <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k1" />
                          <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="HandId" />
                          <ColumnReference Column="Expr1003" />
                        </OuterReferences>
                        <RelOp AvgRowSize="32" EstimateCPU="330.366" EstimateIO="790.88" EstimateRebinds="0" EstimateRewinds="0" EstimateRows="195691" LogicalOp="Index Scan" NodeId="4" Parallel="false" PhysicalOp="Index Scan" EstimatedTotalSubtreeCost="2.88444">
                          <OutputList>
                            <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k1" />
                            <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="HandId" />
                            <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k7" />
                          </OutputList>
                          <RunTimeInformation>
                            <RunTimeCountersPerThread Thread="0" ActualRows="341958" ActualEndOfScans="0" ActualExecutions="1" />
                          </RunTimeInformation>
                          <IndexScan Ordered="true" ScanDirection="BACKWARD" ForcedIndex="false" NoExpandHint="false">
                            <DefinedValues>
                              <DefinedValue>
                                <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k1" />
                              </DefinedValue>
                              <DefinedValue>
                                <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="HandId" />
                              </DefinedValue>
                              <DefinedValue>
                                <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k7" />
                              </DefinedValue>
                            </DefinedValues>
                            <Object Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Index="[ix_dateplayed]" />
                            <Predicate>
                              <ScalarOperator ScalarString="[MYDB].[dbo].[Hands].[k1]=N'IPN'">
                                <Compare CompareOp="EQ">
                                  <ScalarOperator>
                                    <Identifier>
                                      <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k1" />
                                    </Identifier>
                                  </ScalarOperator>
                                  <ScalarOperator>
                                    <Const ConstValue="N'IPN'" />
                                  </ScalarOperator>
                                </Compare>
                              </ScalarOperator>
                            </Predicate>
                          </IndexScan>
                        </RelOp>
                        <RelOp AvgRowSize="88" EstimateCPU="0.0001581" EstimateIO="0.003125" EstimateRebinds="195691" EstimateRewinds="0" EstimateRows="1" LogicalOp="Clustered Index Seek" NodeId="6" Parallel="false" PhysicalOp="Clustered Index Seek" EstimatedTotalSubtreeCost="621.331">
                          <OutputList>
                            <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="d1" />
                            <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k3" />
                            <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k4" />
                            <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k5" />
                            <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k6" />
                          </OutputList>
                          <RunTimeInformation>
                            <RunTimeCountersPerThread Thread="0" ActualRows="341958" ActualEndOfScans="0" ActualExecutions="341958" />
                          </RunTimeInformation>
                          <IndexScan Lookup="true" Ordered="true" ScanDirection="FORWARD" ForcedIndex="false" NoExpandHint="false">
                            <DefinedValues>
                              <DefinedValue>
                                <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="d1" />
                              </DefinedValue>
                              <DefinedValue>
                                <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k3" />
                              </DefinedValue>
                              <DefinedValue>
                                <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k4" />
                              </DefinedValue>
                              <DefinedValue>
                                <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k5" />
                              </DefinedValue>
                              <DefinedValue>
                                <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k6" />
                              </DefinedValue>
                            </DefinedValues>
                            <Object Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Index="[PK_Hands]" TableReferenceId="-1" />
                            <SeekPredicates>
                              <SeekPredicate>
                                <Prefix ScanType="EQ">
                                  <RangeColumns>
                                    <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k1" />
                                    <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="HandId" />
                                  </RangeColumns>
                                  <RangeExpressions>
                                    <ScalarOperator ScalarString="[MYDB].[dbo].[Hands].[k1]">
                                      <Identifier>
                                        <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k1" />
                                      </Identifier>
                                    </ScalarOperator>
                                    <ScalarOperator ScalarString="[MYDB].[dbo].[Hands].[HandId]">
                                      <Identifier>
                                        <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="HandId" />
                                      </Identifier>
                                    </ScalarOperator>
                                  </RangeExpressions>
                                </Prefix>
                              </SeekPredicate>
                            </SeekPredicates>
                          </IndexScan>
                        </RelOp>
                      </NestedLoops>
                    </RelOp>
                    <Predicate>
                      <ScalarOperator ScalarString="[MYDB].[dbo].[Hands].[k4]=($10.0000) AND [MYDB].[dbo].[Hands].[k6]=(1) AND [MYDB].[dbo].[Hands].[k5]=(1) AND [MYDB].[dbo].[Hands].[k3]&gt;=(2) AND [MYDB].[dbo].[Hands].[k3]&lt;=(10)">
                        <Logical Operation="AND">
                          <ScalarOperator>
                            <Compare CompareOp="EQ">
                              <ScalarOperator>
                                <Identifier>
                                  <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k4" />
                                </Identifier>
                              </ScalarOperator>
                              <ScalarOperator>
                                <Const ConstValue="($10.0000)" />
                              </ScalarOperator>
                            </Compare>
                          </ScalarOperator>
                          <ScalarOperator>
                            <Compare CompareOp="EQ">
                              <ScalarOperator>
                                <Identifier>
                                  <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k6" />
                                </Identifier>
                              </ScalarOperator>
                              <ScalarOperator>
                                <Const ConstValue="(1)" />
                              </ScalarOperator>
                            </Compare>
                          </ScalarOperator>
                          <ScalarOperator>
                            <Compare CompareOp="EQ">
                              <ScalarOperator>
                                <Identifier>
                                  <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k5" />
                                </Identifier>
                              </ScalarOperator>
                              <ScalarOperator>
                                <Const ConstValue="(1)" />
                              </ScalarOperator>
                            </Compare>
                          </ScalarOperator>
                          <ScalarOperator>
                            <Compare CompareOp="GE">
                              <ScalarOperator>
                                <Identifier>
                                  <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k3" />
                                </Identifier>
                              </ScalarOperator>
                              <ScalarOperator>
                                <Const ConstValue="(2)" />
                              </ScalarOperator>
                            </Compare>
                          </ScalarOperator>
                          <ScalarOperator>
                            <Compare CompareOp="LE">
                              <ScalarOperator>
                                <Identifier>
                                  <ColumnReference Database="[MYDB]" Schema="[dbo]" Table="[Hands]" Column="k3" />
                                </Identifier>
                              </ScalarOperator>
                              <ScalarOperator>
                                <Const ConstValue="(10)" />
                              </ScalarOperator>
                            </Compare>
                          </ScalarOperator>
                        </Logical>
                      </ScalarOperator>
                    </Predicate>
                  </Filter>
                </RelOp>
              </Top>
            </RelOp>
          </QueryPlan>
        </StmtSimple>
      </Statements>
    </Batch>
  </BatchSequence>
</ShowPlanXML>
like image 513
Mr. Flibble Avatar asked Dec 08 '08 18:12

Mr. Flibble


1 Answers

As I hinted in a comment, I have done this with a single Oracle table approaching 8 TB consisting of over two billion rows growing at the rate of forty million rows per day. However, in my case, the users were two million (and growing) customers accessing this data over the web, 24x7, and literally ANY of the rows was subject to being accessed. Oh, and new rows had to be added within two minutes of real-time.

You are probably I/O bound, not CPU or memory bound, so optimizing the disk access is critical. Your RAM is fine--more than adequate. Using multiple cores would be helpful, but limited if the I/O is not parallelized.

Several people have suggested splitting up the data, which should be taken seriously since it is far better and more effective than any other solution (nothing is faster than not touching the data at all).

You say you can't split the data because all the data is used: IMPOSSIBLE! There is no way that your users are paging through one million rows per day or one hundred million rows total. So, get to know how your users are ACTUALLY using the data--look at every query in this case.

More importantly, we are not saying that you should DELETE the data, we are saying to SPLIT the data. Clone the table structure into multiple, similarly-named tables, probably based on time (one month per table, perhaps). Copy the data into the relevant tables and delete the original table. Create a view that performs a union over the new tables, with the same name as the original table. Change your insert processing to target the newest table (assuming that it is appropriate), and your queries should still work against the new view.

Your savvy users can now start to issue their queries against a subset of the tables, perhaps even the newest one only. Your unsavvy users can continue to use the view over all the tables.

You now have a data management strategy in the form of archiving the oldest table and deleting it (update the view definition, of course). Likewise, you will need to create a new table periodically and update the view definition for that end of the data as well.

Expect to not be able to use unique indexes: they don't scale beyond about one-to-two million rows. You may also have to modify some other tactics/advice as well. At one hundred million rows and 400 GB, you have entered another realm of processing.

Beyond that, use the other suggestions--analyze the actual performance using the many tools already available in SQL Server and the OS. Apply the many well-known tuning techniques that are readily available on the web or in books.

However, do NOT experiment! With that much data, you don't have time for experiments and the risk is too great. Study carefully the available techniques and your actual performance details, then choose one step at a time and give each one a few hours to days to reveal its impact.

like image 149
Rob Williams Avatar answered Oct 24 '22 10:10

Rob Williams