Saturday, 28 June 2014

SSIS and Minimally Logged Inserts

The SQL Server engine offers numerous methods to perform minimally logged inserts into an empty clustered table (http://technet.microsoft.com/en-us/library/ms191244%28v=sql.105%29.aspx) if we meet the prerequisites (http://technet.microsoft.com/en-us/library/ms190422%28v=sql.105%29.aspx), but what about when using SSIS? There’s quite a few places online that mention that it’s supported as an aside but don’t give away too much info on what the prerequisites are.

Going from the prerequisites listed for the other bulk inserts you’d think that a a sorted insert into an OleDb destination with Table lock turned on, Rows per Batch and Maximum Insert Commit Size left to the defaults, and an ORDER hint on the clustered column(s) would be enough, but is it?

Usually, no. Inserts into empty clustered tables (B-trees) via SSIS are minimally logged if the following is true:

  • The Bulk Insert task is used with a flat file sorted in clustered key order and the following options are set:
    • Table lock: checked
    • Batch size: 0
    • SortedData has the clustered key order specified
    only useful if your source is a ordered flat file. Not much use when moving data from another database table.
    • The SQL Destination task is used with a source sorted in clustered key order and the following options are set:
      • Table lock: checked
      • Order columns has the clustered key order specified

    only useful when the SSIS package is executed on the same server as the SQL instance.

    • The OleDb destination task is used with a source sorted in clustered key order and the following options are set:
      • Table lock: checked
      • Rows per batch: blank (default)
      • Maximum insert commit size: 0
      • FastLoadOptions has order hint added: TABLOCK,ORDER(<Clustered Columns,..>)

    And there you have it! Keep reading if you’re interested in the “why” and an explanation of the warning message you’ll get when setting the insert commit size to 0, but all the important info is above.

    The Investigation

    First, we need some data.

    create table dbo.source_data (id int primary key, data char(100) not null)
    go
    create table dbo.destination_pk (id int primary key, data char(100) not null)
    go
    insert into dbo.source_data (id,data)
    select top 850000 row_number() over (order by sysdatetime()), 'blah blah' from master..spt_values a, master..spt_values b, master..spt_values c
    go
    sp_spaceused 'dbo.source_data'
    go

    The above code will generate about 95 meg of data. I’ve tried to stay under 100 meg for an important reason that’s related to the internal SSIS buffers (but more on that later).

    Next, we need some way to see what’s happening in the logs. For that we can use the undocumented fn_dblog() function (http://www.confio.com/logicalread/sql-server-dbcc-log-command-details-and-example/).

    -- Log records and record size
    SELECT COUNT(*)AS numrecords,
      CAST((COALESCE(SUM([Log Record LENGTH]), 0))
        / 1024. / 1024. AS NUMERIC(12, 2)) AS size_mb
    FROM sys.fn_dblog(NULL, NULL) AS D
    WHERE AllocUnitName = 'dbo.destination_pk' OR AllocUnitName LIKE 'dbo.destination_pk.%';

    -- Breakdown of Log Record Types
    SELECT Operation, Context,
      AVG([Log Record LENGTH]) AS AvgLen, COUNT(*) AS Cnt
    FROM sys.fn_dblog(NULL, NULL) AS D
    WHERE AllocUnitName = 'dbo.destination_pk' OR AllocUnitName LIKE 'dbo.destination_pk.%'
    GROUP BY Operation, Context, ROUND([Log Record LENGTH], -2)
    ORDER BY AvgLen, Operation, Context;

    (The above SQL for retrieving the details for these tests was ruthlessly stolen from: http://sqlmag.com/t-sql/minimally-logged-inserts)

    Baseline

    We need to perform a minimally logged insert into our destination table to have something to compare SSIS against. I exported our test data into a text file in native format using bcp  (c:\> bcp "SELECT id,data from dbo.source_data order by id" queryout c:\myout.txt -T -S. -dtest -n)

    I then imported it back using BULK INSERT

    bulk insert test.dbo.destination_pk from 'c:\myout.txt'
    with(
    DATAFILETYPE = 'native'
    ,TABLOCK
    ,ORDER(ID))

    The results from fn_dblog() are

    numrecords size_mb    
    4716 0.31    
           
           
    Operation Context AvgLen Cnt
    LOP_SET_BITS LCX_GAM 60 1529
    LOP_SET_BITS LCX_IAM 60 1529
    LOP_FORMAT_PAGE LCX_IAM 84 1
    LOP_MODIFY_ROW LCX_PFS 88 1649

    LOP_MODIFY_ROW

    LCX_IAM 100 8

    This tells us that there were 4176 log entries that take up 0.31 of a megabyte. Quite a bit less than the 850,000 rows and 95meg of data, so we can be fairly sure the insert was minimally logged.

    SSIS Scenario 1 – Default Data Flow

    A simple data flow like the one described in the opening paragraphs. A default package with a data flow OleDb source (select id,data from dbo.source_data order by id) into an OleDb destination with the following settings:

    • Table lock: checked. Keep identity, Keep nulls and Check constraints unchecked
    • Rows per batch: blank (default)
    • Maximum insert commit size: 2147483647 (default)
    • FastLoadOptions has order hint added: TABLOCK,ORDER(ID)

    OleDb Scenario 1

    The data flow DefaultBufferMaxRows and DefaultBufferSize are left at the default of 10000 and 10485760 respectively.

    The results from fn_dblog() are

    numrecords size_mb    
    963726 165.54    
           
           
    Operation Context AvgLen Cnt
    LOP_DELETE_SPLIT LCX_INDEX_INTERIOR 60 43
    LOP_SET_BITS LCX_GAM 60 1530
    LOP_SET_BITS LCX_IAM 60 1530
    LOP_INSYSXACT LCX_CLUSTERED 62 47332
    LOP_INSYSXACT LCX_INDEX_INTERIOR 62 23928
    LOP_MODIFY_ROW LCX_PFS 81 13511
    LOP_FORMAT_PAGE LCX_HEAP 84 11833
    LOP_FORMAT_PAGE LCX_IAM 84 1
    LOP_FORMAT_PAGE LCX_INDEX_INTERIOR 84 44
    LOP_INSERT_ROWS LCX_INDEX_INTERIOR 84 11877
    LOP_MODIFY_HEADER LCX_HEAP 84 11833
    LOP_MODIFY_HEADER LCX_INDEX_INTERIOR 84 43
    LOP_MODIFY_ROW LCX_IAM 100 8
    LOP_INSERT_ROWS LCX_CLUSTERED 196 840170
    LOP_INSERT_ROWS LCX_INDEX_INTERIOR 5012 43

    Not minimally logged, then? More log entries than we have rows, and over 50% more data. It’s also interesting that there are 840,170 individual inserts into the clustered index even though we had 850,000 rows. So that leaves 9,830 rows that appear to have been minimally logged. That looks suspiciously like the ‘average’ size we see for an SSIS buffer (you know how the rows in the dataflow seem to increase by approximately 10,0000 rows per refresh?). In fact if we turn on a data viewer we see that each full buffer contains 9,830 rows.

    Buffer Scenario 1

    What happens when we increase the data flow buffers from the default 10,000 rows and 10 megabytes?

    SSIS Scenario 2 – Maximum Data Flow Buffers

    A simple data flow like the one described in scenario 1. A default package with a data flow OleDb source (select id,data from dbo.source_data order by id) into an OleDb destination with the following settings:

    • Table lock: checked. Keep identity, Keep nulls and Check constraints unchecked
    • Rows per batch: blank (default)
    • Maximum insert commit size: 2147483647 (default)
    • FastLoadOptions has order hint added: TABLOCK,ORDER(ID)

    But this time we also change the DefaultBufferMaxRows and DefaultBufferSize to 1000000 and 104857600 respectively. 100 megabytes is a hard limit in SSIS for the buffer size (remember how our source data table size is just under 100 meg?)

    Data Flow Scenario 2

    The results from fn_dblog() are

    numrecords size_mb    
    4716 0.31    
           
           
    Operation Context AvgLen Cnt
    LOP_SET_BITS LCX_GAM 60 1529
    LOP_SET_BITS LCX_IAM 60 1529
    LOP_FORMAT_PAGE LCX_IAM 84 1
    LOP_MODIFY_ROW LCX_PFS 88 1649

    LOP_MODIFY_ROW

    LCX_IAM 100 8

    The same as our baseline! So does this mean that we can only minimally log the first 100 megabytes of a data flow? But more importantly why is this happening?

    According to the OleDb Destination page on TechNet (http://msdn.microsoft.com/en-us/library/ms188439.aspx) we have the following two settings:

    • Rows per batch
      • Specify the number of rows in a batch. The default value of this property is –1, which indicates that no value has been assigned.
      • Clear the text box in the OLE DB Destination Editor to indicate that you do not want to assign a custom value for this property.
    • Maximum insert commit size
      • Specify the batch size that the OLE DB destination tries to commit during fast load operations. The value of 0 indicates that all data is committed in a single batch after all rows have been processed.
      • A value of 0 might cause the running package to stop responding if the OLE DB destination and another data flow component are updating the same source table. To prevent the package from stopping, set the Maximum insert commit size option to 2147483647.
      • If you provide a value for this property, the destination commits rows in batches that are the smaller of (a) the Maximum insert commit size, or (b) the remaining rows in the buffer that is currently being processed.

    The above description reads as if 0 and 2147483647 are interchangeable for the Maximum insert commit size, and as non-zero value will avoid the issue with locking it should be the one to use.

    But a non-zero value for the Maximum insert size is treated by SSIS as “just another number” whether it’s the default 2147483647 or not, and then the last bolded point comes into play. A batch for our minimally logged insert scenario is the smallest of:

    • Rows per batch. Blank means it’s ignored.
    • Maximum insert commit size: 2147483647 rows
    • DefaultBufferMaxRows: 10000 by default. Max of 2147483647
    • DefaultBufferSize, 10485760 bytes by default. Max of 104857600

    It’s the data flow properties in most cases that end up dictating the insert batch size and that’s what we saw in scenario 1. Batches are very important when inserting into an empty B-tree because only the first batch is minimally logged, after then you’re inserting into a non-empty B-tree and are fully logged. Only the first 10meg batch was minimally logged in scenario 1 and the remaining data inserted was fully logged. The buffer defaults and maximums are documented here http://msdn.microsoft.com/en-us/library/ms141031%28v=sql.105%29.aspx

    What about when we set the Maximum insert commit size to 0, warnings be damned?

    SSIS Scenario 3 – Maximum Inset Commit Size Set to 0

    A simple data flow like the one described in the above. A default package with a data flow OleDb source (select id,data from dbo.source_data order by id) into an OleDb destination with the following settings:

    • Table lock: checked. Keep identity, Keep nulls and Check constraints unchecked
    • Rows per batch: blank (default)
    • Maximum insert commit size: 0
    • FastLoadOptions has order hint added: TABLOCK,ORDER(ID)

    OleDb Scenario 3

    The data flow DefaultBufferMaxRows and DefaultBufferSize are left at the default of 10000 and 10485760 respectively.

    he results from fn_dblog() are

    numrecords size_mb    
    4716 0.31    
           
           
    Operation Context AvgLen Cnt
    LOP_SET_BITS LCX_GAM 60 1529
    LOP_SET_BITS LCX_IAM 60 1529
    LOP_FORMAT_PAGE LCX_IAM 84 1
    LOP_MODIFY_ROW LCX_PFS 88 1649

    LOP_MODIFY_ROW

    LCX_IAM 100 8

    The same as our baseline and therefore minimally logged. The benefit of this setup is that it applies to any volume of data, whereas scenario 2 only helps with data volumes below 100meg. Moving less than 100 meg is a few second job fully logged or unlogged so leaving the Maximum insert commit size at the default and maxing out the DefaultBufferSize is kind of useless. But what about the informational warning we get from the OleDb destination about the package potentially not responding?

    [OLE DB Destination [2]] Information: The Maximum insert commit size property of the OLE DB destination "OLE DB Destination" is set to 0. This property setting can cause the running package to stop responding. For more information, see the F1 Help topic for OLE DB Destination Editor (Connection Manager Page).

    The help menu (and the excerpt from TechNet pasted in the SSIS Scenario 2 section above) states: A value of 0 might cause the running package to stop responding if the OLE DB destination and another data flow component are updating the same source table. To prevent the package from stopping, set the Maximum insert commit size option to 2147483647.

    It only applies when multiple processes are inserting into the same table. We aren’t trying to have multiple processes insert data into our destination table concurrently, in fact we have a table lock enabled which would prevent this anyway. So this warning even doesn’t apply to minimally logged insert scenarios.

    Epilogue – Trace Flag 610

    Trace flag 610 allows minimally logged inserts into non-empty indexed tables, and turning it on actually does greatly reduce the log activity in scenario 1 (http://technet.microsoft.com/en-us/library/dd425070(v=sql.100).aspx).

    numrecords size_mb    
    67144 5.68    
           
           
    Operation Context AvgLen Cnt
    LOP_DELETE_SPLIT LCX_INDEX_INTERIOR 60 37
    LOP_INSYSXACT LCX_CLUSTERED 62 284
    LOP_INSYSXACT LCX_INDEX_INTERIOR 62 364
    LOP_SET_BITS LCX_GAM 62 2234
    LOP_SET_BITS LCX_IAM 62 2234
    LOP_MODIFY_HEADER LCX_BULK_OPERATION_PAGE 76 9793
    LOP_FORMAT_PAGE LCX_BULK_OPERATION_PAGE 84 14160
    LOP_FORMAT_PAGE LCX_INDEX_INTERIOR 84 37
    LOP_INSERT_ROWS LCX_INDEX_INTERIOR 84 9793
    LOP_MODIFY_HEADER LCX_HEAP 84 19512
    LOP_MODIFY_HEADER LCX_INDEX_INTERIOR 84 37
    LOP_MODIFY_ROW LCX_PFS 90 6137
    LOP_INSERT_ROWS LCX_CLUSTERED 196 2485
    LOP_INSERT_ROWS LCX_INDEX_INTERIOR 5012 37

    But just changing the Maximum insert commit size to 0 will result in a true minimally logged insert, so definitely go with that if you want a minimally logged insert into an empty indexed table.

    Trace flag 610 is certainly something to look into after the initial load/migration of say large fact tables as it will usually result in a performance increase in subsequent incremental loads.