Let's say I have a User which has a status and the user's status can be 'active', 'suspended' or 'inactive'.
Now, when creating the database, I was wondering... would it be better to have a column with the string value (with an enum type, or rule applied) so it's easier to both query and know the current user status or are joins better and I should join in a UserStatuses table which contains the possible user statuses?
Assuming, of course statuses can not be created by the application user.
Edit: Some clarification
On most systems it makes little or no difference to performance. Personally I'd use a short string for clarity and join that to a table with more detail as you suggest.
create table intLookup
(
pk integer primary key,
value varchar(20) not null
)
insert into intLookup (pk, value) values
(1,'value 1'),
(2,'value 2'),
(3,'value 3'),
(4,'value 4')
create table stringLookup
(
pk varchar(4) primary key,
value varchar(20) not null
)
insert into stringLookup (pk, value) values
(1,'value 1'),
(2,'value 2'),
(3,'value 3'),
(4,'value 4')
create table masterData
(
stuff varchar(50),
fkInt integer references intLookup(pk),
fkString varchar(4)references stringLookup(pk)
)
create index i on masterData(fkInt)
create index s on masterData(fkString)
insert into masterData
(stuff, fkInt, fkString)
select COLUMN_NAME, (ORDINAL_POSITION %4)+1,(ORDINAL_POSITION %4)+1 from INFORMATION_SCHEMA.COLUMNS
go 1000
This results in 300K rows.
select
*
from masterData m inner join intLookup i on m.fkInt=i.pk
select
*
from masterData m inner join stringLookup s on m.fkString=s.pk
On my system (SQL Server) - the query plans, I/O and CPU are identical - execution times are identical. - The lookup table is read and processed once (in either query)
There is NO difference using an int or a string.
I think, as a whole, everyone has hit on important components of the answer to your question. However, they all have good points which should be taken together, rather than separately.
As logixologist mentioned, a healthy amount of Normalization is generally considered to increase performance. However, in contrast to logixologist, I think your situation is the perfect time for normalization. Your problem seems to be one of normalization. In this case, using a numeric key as Santhosh suggested which then leads back to a code table containing the decodes for the statuses will result in less data being stored per record. This difference wouldn't show in a small Access database, but it would likely show in a table with millions of records, each with a status.
As David Aldridge suggested, you might find that normalizing this particular data point will result in a more controlled end-user experience. Normalizing the status field will also allow you to edit the status flag at a later date in one location and have that change perpetuated throughout the database. If your boss is like mine, then you might have to change the Status of Inactive to Closed (and then back again next week!), which would be more work if the status field was not normalized. By normalizing, it's also easier to enforce referential integrity. If a status key is not in the Status code table, then it can't be added to your main table.
If you're concerned about the performance when querying in the future, then there are some different things to consider. To pull back status, if it's normalized, you'll be adding a join to your query. That join will probably not hurt you in any sized recordset but I believe it will help in larger recordsets by limiting the amount of raw text that must be handled. If your primary concern is performance when querying the data, here's a great resource on how to optimize queries: http://www.sql-server-performance.com/2007/t-sql-where/ and I think you'll find that a lot of the rules discussed here will also apply to any inclusion criteria you enforce in the join itself.
Hope this helps!
Christopher
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With