Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Database design : preferred field length for file paths

Tags:

I have to store file paths in a DB field (/tmp/aaa/bbb, C:\temp\xxx\yyy, etc.). I can't really tell how long they could be.

Given this http://en.wikipedia.org/wiki/Comparison_of_file_systems and that http://msdn.microsoft.com/en-us/library/aa365247.aspx, depending on the file system there could be theoretically no length limit for a path.

I guess that defining this field as a LONGBLOB or VARCHAR(very high value) wouldn't be wise. I've thought about something like VARCHAR(1024) which should be suitable for most frequent (even if not all) cases, and not too big as a DB field. What would you recommend ?

Thanks.

like image 933
Frosty Z Avatar asked Dec 07 '10 14:12

Frosty Z


1 Answers

Use the appropriate length for the data you intend to support. Since you're using SQL Server you should use nvarchar(260) as the upper limit for storing path names, since that is the specification limit for typical Windows machines. Under certain circumstances you can create paths longer than that, however Windows Explorer will tend to have problems handling them. SQL Server cannot handle filenames longer than 260 characters. This includes SQL Server on Linux.

I can prove SQL Server uses an nvarchar(260) column internally to store SQL Server Database filenames, with the path included. Checking the definition of the sys.master_files view, we see the following T-SQL:

 CREATE VIEW sys.master_files AS
    SELECT
        database_id     = f.dbid,
        file_id         = f.fileid,
        file_guid       = f.fileguid,
        type            = f.filetype,
        type_desc       = ft.name,
        data_space_id   = f.grpid,
        name            = f.lname,
        physical_name   = f.pname,
        state           = convert(tinyint, case f.filestate     -- Map enum EMDFileState to AvailablityStates
                                when 0 then 0 when 10 then 0    -- ONLINE
                                when 4 then 7   -- DEFUNCT
                                when 5 then 3 when 9 then 3 -- RECOVERY_PENDING
                                when 7 then 1 when 8 then 1 when 11 then 1  -- RESTORING
                                when 12 then 4  -- SUSPECT
                                else 6 end),    -- OFFLINE
        state_desc      = st.name,
        f.size,
        max_size            = f.maxsize,
        f.growth,
        is_media_read_only  = sysconv(bit, f.status & 8),       -- FIL_READONLY_MEDIA
        is_read_only            = sysconv(bit, f.status & 16),  -- FIL_READONLY
        is_sparse           = sysconv(bit, f.status & 256), -- FIL_SPARSE_FILE
        is_percent_growth   = sysconv(bit, f.status & 32),  -- FIL_PERCENT_GROWTH
        is_name_reserved        = sysconv(bit, case f.filestate when 3 then 1 else 0 end), -- x_efs_DroppedReusePending
        create_lsn          = GetNumericLsn(f.createlsn),
        drop_lsn                = GetNumericLsn(f.droplsn),
        read_only_lsn           = GetNumericLsn(f.readonlylsn),
        read_write_lsn      = GetNumericLsn(f.readwritelsn),
        differential_base_lsn   = GetNumericLsn(f.diffbaselsn),
        differential_base_guid  = f.diffbaseguid,
        differential_base_time  = nullif(f.diffbasetime, 0),
        redo_start_lsn          = GetNumericLsn(f.redostartlsn),
        redo_start_fork_guid    = f.redostartforkguid,
        redo_target_lsn     = GetNumericLsn(f.redotargetlsn),
        redo_target_fork_guid   = f.forkguid,
        backup_lsn          = GetNumericLsn(f.backuplsn),
        credential_id       = cr.credential_id
    FROM sys.sysbrickfiles f
    LEFT JOIN sys.syspalvalues st ON st.class = 'DBFS' AND st.value = f.filestate
    LEFT JOIN sys.syspalvalues ft ON ft.class = 'DBFT' AND ft.value = f.filetype
    LEFT JOIN sys.credentials cr ON f.pname LIKE cr.name + N'%' COLLATE database_default
    WHERE f.dbid < 0x7fff -- consistent with sys.databases
        AND f.pruid = 0
        AND f.filestate NOT IN (1, 2)   -- x_efs_Dummy, x_efs_Dropped
        AND has_access('MF', 1) = 1

Microsoft Docs for sys.master_files says this about the physical_name column:

physical_name nvarchar(260) Operating-system file name.

But let's not trust that. We see the physical file name is referenced as physical_name = f.pname. And the table alias "f" points to FROM sys.sysbrickfiles f. Therefore, SQL Server stores the filename in sys.sysbrickfiles, which is an internal table that is only visible from the Dedicated Administrator Connection, or DAC as its frequently known. Connecting to the DAC, and generating a temp table from the output of sys.sysbrickfiles, we see the following:

CREATE TABLE #sysbrickfiles
(
      brickid           int              NOT NULL
    , dbid              int              NOT NULL
    , pruid             int              NOT NULL
    , fileid            int              NOT NULL
    , grpid             int              NOT NULL
    , status            int              NOT NULL
    , filetype          tinyint          NOT NULL
    , filestate         tinyint          NOT NULL
    , size              int              NOT NULL
    , maxsize           int              NOT NULL
    , growth            int              NOT NULL
    , lname             nvarchar(128)    NOT NULL
    , pname             nvarchar(260)    NOT NULL
    , createlsn         binary(10)       NULL
    , droplsn           binary(10)       NULL
    , fileguid          uniqueidentifier NULL
    , internalstatus    int              NOT NULL
    , readonlylsn       binary(10)       NULL
    , readwritelsn      binary(10)       NULL
    , readonlybaselsn   binary(10)       NULL
    , firstupdatelsn    binary(10)       NULL
    , lastupdatelsn     binary(10)       NULL
    , backuplsn         binary(10)       NULL
    , diffbaselsn       binary(10)       NULL
    , diffbaseguid      uniqueidentifier NULL
    , diffbasetime      datetime         NOT NULL
    , diffbaseseclsn    binary(10)       NULL
    , redostartlsn      binary(10)       NULL
    , redotargetlsn     binary(10)       NULL
    , forkguid          uniqueidentifier NULL
    , forklsn           binary(10)       NULL
    , forkvc            bigint           NOT NULL
    , redostartforkguid uniqueidentifier NULL
);

As you can see, the pname column is indeed defined as nvarchar(260).

Also, if we attempt to create a database using a filename that is longer than 260 characters, we see an error is returned:

Msg 103, Level 15, State 3, Line 7
The file that starts with 'F:\AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAARGH.mdf' is too long. Maximum length is 259.

Using anything other than an nvarchar(260) column to store filenames in SQL Server is both wasteful, and creates technical debt.

The length of a column is extremely important performance-wise. Column length directly affects:

  • memory grants for queries against the column. When the query processor creates a query plan, it uses the size of each column present in the query as a basis for the amount of memory required to run the query. It doesn't use the actual size of the data present in each column, instead it "guesses" that the average size of the data will be 50% of the maximum length of the column.
  • Ability to index the column efficiently. Larger columns create significantly larger indexes. Larger indexes require the more memory and disk throughput than smaller indexes. SQL Server has a maximum key length of 1700 bytes for non-clustered indexes (as of SQL Server 2016) and a maximum key length of 900 bytes for clustered indexes. If you attempt to create an index on columns larger than those maximum amounts, you get errors, and possibly not until run-time when it can be very costly to fix.
  • character-based primary/foreign key performance is severely affected by larger columns lengths. When primary keys are referenced via foreign keys, the size requirements for memory, disk, and I/O are duplicated for each foreign key. Take for example a Customer table where the key is the CustomerName column, defined as varchar(500). Every table that references customers will now require a 500-byte CustomerName column. If that column was defined as a varchar(100) instead, every query referencing those columns will save 200 bytes per row in memory and disk I/O.
  • Erik Darling shows that Predicate Pushdown does not work for (MAX) data types, which can severely limit performance.
like image 128
Hannah Vernon Avatar answered Sep 27 '22 20:09

Hannah Vernon