Sunday, July 13, 2014

[MSSQL] Why Primary Keys can be important on large tables

By Ben Kubicek, 2014/06/30
An easy answer to this question, why primary keys can be important, is: because if you don't have a primary key / clustered index and your table does inserts, updates and deletes, SQL Server will fragment the table, and it will take up a lot of extra space.  Now, are primary keys required?  Are they always needed?  Only your design and implementation can answer those questions for your situation.
If you don't have a primary key / clustered index then you have a heap table.  Note that a primary key and a clustered index are not the same thing.  The primary key has to be unique, but a clustered index can have duplicates.  In general, if your table is doing inserts, update and deletes, it is a bad idea to have a table without a clustered index.
I inherited a database from a previous employee.  This database was used as a staging area to load data into the live database.  Records are inserted, processed, and then deleted after a period of time.  The processes worked fine, so I didn't think much about reviewing the tables inside of this staging db.  As time went on the database continued to grow, which didn't make sense since the number of records didn't match the growth I was seeing.  When I took a closer look at the tables I realized none of the tables had primary keys or clustered indexes.
Let's take just a moment to talk about what it takes for a column to be a primary key.  First the column needs to be set as not null.  Second the column needs to be unique.  So an identity column can be a great choice for a primary key.  Here is a funny thing about those tables that didn't have a primary key:  They already had an identity column.
So I altered each table and made the existing identity column the primary key / clustered index.  The effect was the total size of the database dropped significantly.  Here is an example of one table:
Count    TableSize      DataSpace IndexSpace UnusedSpace
1310149  3134048 KB     462464 KB 24 KB      2671560 KB
After adding a primary key:
Count    TableSize      DataSpace  IndexSpace UnusedSpace
1310167  163976 KB      163536 KB  384 KB     56 KB
If you notice before there was a primary key / clustered index, the data space used and unused space is quite large.  After a primary key was added the data space and the unused space dropped and the index space used didn't go up that much.  I had a 2GBb database that reduced down to 1GB after adding a primary key / clustered index to the tables.
Now we get to the part of the article that might help you if you want to add a primary key to a table that doesn't have one. There are two easy ways you can make an existing column into a primary key.  First you can right click the table in Management Studio and click Design. 

Then select the identity column or the column(s) you want to be the primary key and click the key button.

You should see a little key next to the column(s) you selected and then you can click the save button.

Once the change saves you will have a primary key that is the clustered index.
If there are a lot of rows in the table you are trying to add a primary key to, you might get a time out error when you click the save button if you use the right click design table method.  This leads to the next method, which is using T-SQL.  You can write a simple alter table statement to add a primary key / clustered index.
ALTER TABLE dbo.tableName ADD PRIMARY KEY 
CLUSTERED (columnname)

You would replace dbo.tableName with your table and columnname with the column you want to make the primary key / clustered index.
Now if you don't have a good column you can use as a primary key, you will have to add a column. The simplest thing to do would be to add a primary key identity column to the table. The T-SQL would look something like this:
ALTER TABLE dbo.tableName 
ADD [id] INT IDENTITY(1,1) NOT NULL PRIMARY KEY
Note if you have more than two billion rows in your table you would want to use a bigint instead of an int.  It is important to note that there are cases when the primary key should not be the clustered index since the table may benefit from other column(s) as the clustered index.
If you need help finding tables that are missing primary keys you can run this SQL:
SELECT
    s.name
  , t.name
  FROM
    sys.tables t
  JOIN
    sys.schemas s
    ON t.schema_id = s.schema_id
  WHERE
    t.type = 'U'
    AND NOT EXISTS ( SELECT
                        k.name
                      FROM
                        sys.key_constraints k
                      WHERE
                        k.parent_object_id = t.object_id
                        AND k.schema_id = s.schema_id
                        AND k.type = 'PK' )
  ORDER BY
    t.name
I hope this article has shown the importance of adding primary keys to large tables to reduce the overall size of the database.

No comments:

Post a Comment