SQL Server Business Intelligence and Data Warehousing

Friday, 4 September 2015

SSIS - SSAS as Data Source - MDX Query Slow (Because it’s Executed Twice!)

Anyone that’s tried using SSAS as a data source (whether in an ETL tool or a data dump using TSQL openrowset and/or linked servers) might have noticed that longer queries seem to run for twice as long when compared to MDX executed directly against the cube in SSMS.

The reason for this is the database engine and SSIS need to know the ‘shape’ of the dataset. This is fine for most datasets because a relational query has its schema defined by the underlying data and a ‘get schema’ command to the data source returns the dataset metadata only.

The nature of mdx is such that the schema of the resultset could be driven by the data itself (eg NON EMPTY could result in columns being excluded) and is not known until the query is returned. So when the SSAS server gets a connection and query that’s only requesting the metadata it executes the query (to define the metadata) and returns that. A split second later the command for the data comes in and SSAS obliges and runs the query again.

Using a linked server or SSIS (with an oledb datasource pointing to an SSAS cube) to execute a query against SSAS results in the query running twice. Once to return the metadata header and again to return the data.

From the Profiler:

This is totally unnecessary for SSIS because the data flow already has the schema defined, and if the returned dataset is different then the Data Flow Task will fail during execution anyway. Setting Delay Validation = true and ValidateExternalMetadata=false makes no difference. The MDX query is still executed twice.

In most cases this is fine as queries are answered fairly quickly, and caching etc would probably help with the same query running in quick succession. But in some cases (like mine! - when the query is a very complex calculation driven leaf level query that can take minutes to hours to execute), the query taking twice as long just isn’t going to cut it.

Connecting to SSAS

There are two ways to connect to SSAS from SSIS using the built in data flow source connectors - OLEDB and ADO.NET

What’s Actually Happening?

OLEDB

The ‘Query Start’ event of SSIS executing a query against SSAS has some interesting additional properties. Comparing the two query executions (metadata and data) shows these differences:

<Content>MetaData</Content> and <ExecutionMode>Prepare</ExecutionMode> are the culprits here.

Is there any way to stop this first one from running or forcing it to behave like the second one (so that it also returns the data)? No, none that I could find :(

I tried to use the Extended Properties area of the connection in SSIS to modify these two properties but had no luck. Setting Content to SchemaData resulted in an error because data isn’t compatible with execution mode prepare. There is no doco on these two connection properties anywhere and I’m not a good enough coder to dive into the DLLs to figure out what the valid options are.

Maybe ADO.NET will work better?

ADO.NET

The slower and less functional cousin of OLEDB (no fast load, native query from variable support) also suffers from firing the same query twice. Probably because it’s simply the same OLEDB SSAS driver wrapped up in ADO.NET (whatever that means!)

It however sends slightly different properties:

Notice how the metadata query is has fewer properties. The data query adds on <Content>SchemaData</Content> and <Format>Tabular</Format>.

Could simply adding Content=SchemaData and Format=Tabular to the extended properties of the ADO.NET connection be enough?

Yes! It is!

Conclusion

Unfortunately the easier to use OLEDB data flow connection double dips when extracting data. ADO.NET is the only other option if time is a factor.

Simply add the following to your ADO.NET Connection Manager under All|Advanced|Extended Properties to stop it from querying the cube twice

Content=SchemaData;Format=Tabular

Enjoy!

Tuesday, 26 May 2015

Self Referencing Linked Servers and Linked Servers with a Logical Name Different to Physical Instance Name

Linked servers + stored procs used to be the go to ETL method in the good old days. But linked servers fell out of favour as ETL tools became more powerful, and physical data marts/warehouses became the flavour of the decade.

With the advent of in-memory analytics solutions they’ve made a bit of a comeback. The reason being, if the data is simple enough you can transform it into a star schema with views over the top of the operational data. These views are used as the source objects for loading into the in-memory analytics tool of choice (Power Pivot/Tabular cubes, Tableau, etc).

It’s a much more rapid dev/agile approach and delivers ROI within days/weeks as opposed to months (or longer!). There’s no monolithic ETL to manage, or need to acquire and manage a high performance server that only does stuff for an hour or two at night, then is only used for weekly or end of month reporting.

Database servers nowadays are fast enough to deliver data from views (you’d be doing the same transformation logic work, probably a few times over, to move the data from the OPS system to staging in your DM to merging into the final dim/facts, to loading into your cube or analytics package), and most organisations have at least one if not more data marts available already. Let the existing infrastructure do the work for you.

One scenario recently involved a number of horizontally partitioned databases. The schemas were identical, but the databases were partitioned by region. Each region had its own version of the database, some on different servers. And different functions were again separated (operational OLTP databases and customer details in the SAP databases)

I needed to transform the data into a star schema using a number views to UNION ALL the tables. But all the data needs to be visible from the same server instance, hence the linked servers.

There’s a few restrictions with the GUI tool that didn’t let me do what I wanted to and I couldn’t find much in the way of recent online details on how to do weird stuff with linked servers, such as:-

Giving a linked server to SQL Server a ‘logical’ name that differed from the physical instance name (e.g. so I could move my view code from dev to test to prod without needing to refactor my linked server queries)
Linking a server back to itself for situations where the infrastructure in dev didn’t match that in prod (e.g. all dev databases for the data that I’d be querying were on the same server, but were spread over multiple servers in production)

Here’s some snippets that’ll let you create circular linked servers (not allowed via the SSMS GUI), and also linked servers to MSSQL with a ‘logical’ name that differs from the server name (again, something not allowed via the SSMS GUI)

Linked Server with Different Logical Name

Say you want to create a linked server to instance [SQLDev\test1] and call it CustomerServer. You can put anything you want into the @datasrc parameter, including the server that you’re creating the linked server on.

Create the Linked Server

EXEC master.dbo.sp_addlinkedserver

@server = N'CustomerServer',

@srvproduct=N'MSSQL',

@provider=N'SQLNCLI',

@datasrc=N'SQLDev\test1'

Create the Credentials

Now you just need to add the credentials. Here are two examples, one that leverages the AD credentials of the user trying to use the linked server (user needs to exist on destination server and have permissions to the objects they’re trying to access), and another using an SQL service account with username customerdata and password customerdata (SQL user needs to exist on the destination server and have the required permissions to access the requested objects)

AD pass-through Authentication - needs Kerberos when double hopping

EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N'CustomerServer',

@useself=N'True',

@locallogin=NULL,

@rmtuser= NULL,

@rmtpassword= NULL

SQL Service Account - don’t forget to turn on mixed authentication on the target server

EXEC master.dbo.sp_addlinkedsrvlogin

@rmtsrvname=N'CustomerServer',

@useself=N'False',

@locallogin=NULL,

@rmtuser= N'customerdata',

@rmtpassword= N'customerdata'

Caution

Attempting to modify this linked server via the SSMS GUI will probably throw all kinds of weird errors. Just delete them and re-create if you need to change anything.

Friday, 21 November 2014

Excel PowerPivot/Tabular Error - The dimension Cardinality was not found in the cube

Quick fix/resolution for something that I noticed when populating a PowerPivot model with data from an OData feed via PowerQuery. The feed had nested XML tables that I pivoted into the main resultset using the PowerQuery “expand”. The resulting column name by default is ParentField.ChildField

I left the column names as default, added it to a PowerPivot model and tried to join it using one of these fields (Job[Client.ID] => Client[ID]) and received the following error

============================
Error Message:
============================

Query (1, 183) The dimension '[Client.ID_Cardinality]' was not found in the cube when the string, [Client.ID_Cardinality], was parsed.

It seems power pivot doesn’t quote identifiers. Quick fix is to rename the pivoted column (Job[Client.ID]) to one that doesn’t include a period in the column name (Job[ClientID])

An example!

Imagine I’ve imported the following data into excel

and added it into a data model

Now I try to relate the Job table to the Client table using [Job]Client.ID => Client[ID]

Quite an error message.

But if in the Job table I rename Client.ID to ClientID

Success!

Something to keep in mind, especially when expanding nested XML tables using PowerQuery. Strip those periods (and probably other special characters)!

Wednesday, 29 October 2014

SSISaaS (Integration Services as a Service) - Azure Data Factory

Microsoft recently announced the public preview of Azure Data Factory. The blurb makes it sound like cloud SSIS:

Compose data storage, movement, and processing services into streamlined data pipelines

Drive business decisions with trusted information. Tie your unstructured and structured data ecosystems together across on-premises and cloud sources using Hadoop processing. Monitor your pipelines and service health using the Azure portal. Consume trusted data with BI and analytics tools and applications.
http://azure.microsoft.com/en-us/services/data-factory/

But don’t be fooled!

Unfortunately in reality it’s a weird hack that requires a collection of hand written json scripts that hardcode the source,transform and destination metadata and need to be registered from your local machine using powershell.

The ADF documentation includes a simple example that copies the contents of a two column, two row CSV into an azure database table. http://azure.microsoft.com/en-us/documentation/articles/data-factory-get-started/

Setup tasks aside, I’ve shown the equivalent development objects for this side by side below (SSIS on the left, ADF on the right)

Recreating a simple CSV->DB table SSIS data flow requires 100ish lines of manually written json scripts (the azure data factory quickstart guide suggests using "notepad") split across multiple files and then locally executed azure powershell cmdlets to 'register' the json with the data factory.

Yes, that scripting needs to be freehanded. Yes, you need to manually upload the individual files to your azure blob storage. Yes, you need to manually execute azure powershell cmdlets locally to ‘register’ your uploaded json to the data factory. No, there is no pipeline/json authoring webapp in the ADF console that has templates, snippets, intellisense or some kind of syntax validation.

I don’t know whether I’m looking at this wrong, but moving even a trivial ETL workflow into this would be a significant effort. And debugging that json…………….. No thanks. I haven’t tried messing up the CSV or introducing a breaking change to the destination schema to see what kind of useful errors it spits out yet but scripting error messages tend to be fairly cryptic and I don’t see why this would be any different. the error messages for the trivial schema issues I introduced were pretty good actually…

The cost looks to be about a dollar a run for the above workflow (3 activities: read, copy, load) unless I’m reading it wrong (I think I am, as I’ve left the simple example experiment running indefinitely and it’s only showing as 0.58 activities). I might execute an SSIS project 20-50 times a day during development. A few (dozen) ADF dev runs to iron out some bugs might burn through the funds a bit too quickly.

I’m not sure what the purpose of this is. It sounds like there’s some cool stuff in there in regards to data monitoring and such, but not at the cost of basic ETL usability. Is it meant to be an ETL or (meta)data analysis tool for web devs, hence the reliance on json?

I can’t see myself using or recommending this in its current state for any kind of BI or data analytics ETL/ELT/ELTL work. It doesn’t do anything unique and the lack of a development gui for common tasks is a deal-breaker.

If you have a large volume of data in the cloud then it might be more worthwhile to spin up in the same affinity group as your blob storage a dedicated VM with SSIS installed and process it there. Use the new Azure Automation feature to spin this VM up and down on a schedule. A beefy VM will set you back a dollar a two an hour. And you can do your dev locally for free*.

This is a shame, and a bit of a shock really. I’ve been playing around in Azure Machine Learning, another azure preview offering, and the webapp gui in that is brilliant. It’s probably less buggy than the current SSDT 2012 SSIS gui! You can drag/drop to build experiments to train your models and publish them from within the gui.

(the ML GUI)

Wednesday, 1 October 2014

Generate Master Package from msdb.syspackages using BIML

Just a quick one – here’s some code that will generate a master package from the data in your msdb.syspackages table. It’s very generic and doesn’t use any custom metadata tables or objects. It’s standalone code that should successfully execute as is.

I usually use a modular approach for loading a data warehouse. I construct one package per target object and aim to have them as independent as possible. This allows me to re-use the same package for multiple workflows. I can use many methods to control the workflow, but usually it takes the form of a master package that is predominantly made up of Execute Package tasks.
It doesn’t take long for for the number of packages in even a small project to start to add up. And manually creating the master package is not a good use of anyone’s time!

The code has been tested in 2008R2 and 2012 (package deployment model only) and assumes you’ve used the SQL Server as your package store. It parses the msdb SSIS Packages table and creates a sequence container for the specified root and every child folder.

An Execute Package task is created in parallel mode in every sequence container that points to the packages that exist in that folder in your SSIS repository. There are two variables that need to be changed. The sServer variable which refers to the SQL server instance that the SSIS packages are deployed to, and the sRootFolder which refers to the SSIS repository folder that holds the packages you want in your master package. It includes all child folders and adds any packages they contain.

<# String sServer = @".\sql2k12";#>      
<# String sRootFolder =@"";#>

<#@ template language="C#" hostspecific="true"#>
<#@ import namespace="System.Data" #>
<# String sConn = String.Format("Provider=SQLNCLI10;Server={0};Initial Catalog=msdb;Integrated Security=SSPI",sServer); #>
<# String sSQL = String.Format(@"
;with cte as (
    SELECT pf.foldername AS RootFolderName
    ,       pf.folderid AS FolderID
    ,       convert(nvarchar(4000),pf.FolderName+'\') AS FolderPath
    ,       pf.FolderName AS FolderName
    ,       cast(null as sysname) as ParentFolder
    from        msdb..sysssispackagefolders pf
    WHERE   pf.foldername = '{0}'
    union all
    SELECT cte.RootFolderName AS RootFolderName
    ,       pf.folderid AS folderid
    ,       convert(nvarchar(4000),cte.FolderPath+pf.foldername+'\') AS FolderPath
    ,       pf.FolderName as FolderName
    ,       cte.FolderName AS ParentFolder
    from        msdb..sysssispackagefolders pf
    INNER join cte
    on              cte.folderid = pf.parentfolderid
)
select RootFolderName AS RootFolderName
,       pf.FolderPath AS FolderPath
,       pf.ParentFolder AS ParentFolder
,       pf.FolderName AS FolderName
,       sp.name as PackageName
,       ROW_NUMBER() OVER (PARTITION BY pf.FolderPath ORDER BY sp.Name ASC) AS StartId
,       ROW_NUMBER() OVER (PARTITION BY pf.FolderPath ORDER BY sp.Name DESC) AS EndId
from        cte pf
left join   msdb..sysssispackages sp
on              sp.folderid = pf.folderid
Order By FolderPath, PackageName
",sRootFolder); #>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
       <Connections>
<Connection Name="SSIS Package Store" ConnectionString="Data Source=<#=sServer#>;Initial Catalog=msdb;Provider=SQLNCLI10.1;Integrated Security=SSPI;Auto Translate=False;Application Name=SSIS Package Store;" />
       </Connections>
       <Packages>
             <Package Name="Master Package" ConstraintMode="Parallel" AutoCreateConfigurationsType="None">
                   
                   <Tasks>
                        <# DataTable tblPackages = ExternalDataAccess.GetDataTable(sConn,sSQL); #>
                        <# foreach (DataRow row in tblPackages.Rows){ #>
                            <# if(row["StartId"].ToString() == "1") { #>
                                <# if(row["FolderName"]=="") { #>
                                    <Container Name="\" ConstraintMode="Parallel">
                                <# } else { #>
                                    <Container Name="<#=row["FolderPath"]#>" ConstraintMode="Parallel">
                                <#}#>
                                <# if(!row.IsNull("PackageName")) { #>
                                    <Tasks>
                                <# }
                            }    #>
                            <# if(!row.IsNull("PackageName")) { #>
                                  <ExecutePackage Name="Execute <#=row["PackageName"]#>">
                                     <SqlServer ConnectionName="SSIS Package Store" PackagePath="\<#=row["FolderPath"]#><#=row["PackageName"]#>" />
                                </ExecutePackage>
                            <#}#>

                            <# if(row["EndId"].ToString() == "1") { #>
                                <# if(!row.IsNull("PackageName")) { #>
                                    </Tasks>
                                <# } #>

                                </Container>
                            <#}    #>
                    <# }#>
                   </Tasks>
             </Package>
       </Packages>
</Biml>

Create a new BIML file in your solution, paste the code in and just specify the connection string to your MSDB database and the root folder (inclusive) in your SSIS package store of the packages you want to be part of the master package.

An Execute Package task is created in parallel mode in every sequence container that points to the packages that exist in that folder in your SSIS repository. It should create a package with a separate sequence container for each folder and every package placed in parallel mode in the relevant sequence container. There’ll also be a connection to your ssis package store.

It doesn’t do nested folders but feel free to add it in – it’d need either a recursive TSQL function or a recursive .NET function hacked into the BIML code. BIML has no .Net debugger and intellisense doesn’t work, so writing any kind of complex .net code is just asking for trouble I think.

This is the type of task that I see myself using BIML for a lot more; automating the tedious, simple parts of SSIS package and solutions. Manually creating a execute package task takes probably close to a dozen clicks and mouse drags that simply pick things off a list. This is what scripting like this excels at. More complex logic like precedence constraints or automating SSIS creation…. It doesn’t take long for the helper scripts to become more complex, harder to debug and take longer to maintain than the files they create.

Say I have a project I’ve deployed to a MasterDataLoad directory which I specify as my sRootFolder. The BIML code will query the following SSIS repository:

and output the following package:

Now all I’d need to do is rearrange it , rename or remove any sequence containers I don’t need and add any precedence constraints specific to the workflow I’m creating.

Thursday, 24 July 2014

SSIS: A deadlock was detected while trying to lock variable

Occasionally, especially when using an SSIS framework with an orchestration table of some kind and event handling, you might come across “Error: A deadlock was detected while trying to lock variable "User::xxxxxxx" for read access. A lock could not be acquired after 16 attempts and timed out.”

What this means is that two processes are attempting to read/write to the same SSIS variable at the same time.

The usual workaround is to explicitly lock/release your variables in your script task or transformation as described here: http://toddmcdermid.blogspot.com/2010/08/use-readonlyvariables-and.html, but what if it’s the out of the box tasks (like the Execute SQL Task that uses parameters and/or saves a resultset to variables) that are locking? Does this mean you need to code your own version in .NET as a script task and handle the variable locking better?

Execute SQL Task and Event Handlers

Imagine setting up a metadata or queue driven ETL framework similar to this:

The “Get Id” task populates some package level variables for this specific ETL package. The work is done in the “Do Work” container, then the ETL run is flagged as complete in some orchestration table.

Now, imagine the “Get Id” task fails and your custom OnError package level event handler uses one of the variables referenced in the “Get Id” task .

You’ll be greeted with the following error after 30-120 seconds (depending on your timeout settings).

Put simply this happens because the variable is locked in the “Get Id” task and the OnError event is fired asynchronously and bubbled up to the package level event handler before the “Get Id” task is finished. The “Log Failure” task in the event handler cannot get access to the variable, and it fires an error.

While it is a problem, it’s not the root cause of the failure, and it could leave some poor application support guy chasing dead ends for a package that locks up and dies intermittently. The actual issue is that the “Get Id” task expects a single row result but the query isn’t returning anything. Wouldn’t it be nice if we could ignore this error from showing up and just focus on the root cause?

Break the Execution Chain

There is a way to break the execution chain. That is, to separate the “Get Id” task and prevent it from getting to the package level OnError handler while still causing your package to fail. This will allow the “Get Id” task to complete before the package level OnError event handler is fired.

Create an OnError event handler on the deadlocking task (“Get Id”) and set the system Propagate flag to False

You can leave the OnError event handler empty or have a custom event for an error that occurs at the initiation step of your package. The Propagate flag set to false means an error on the “Get Id” task doesn’t get bubbled up.
Set the FailParentOnFailure and FailPackageOnFailure to true for the “Get Id” task. Not doing this will result in your package either continuing to execute after the Get Id failure or reporting execution success at the package level.

That’s it! Now when the package executes we get no deadlocks

Disclaimer

The above example is just a quick and simple mock up that shows the deadlocking in action, and probably breaks quite a few design and best practise rules. There are multiple ways to code defensively to prevent a null result where a row is expected from breaking your package, but that’s a story for another day!

edit 31 July 2014: The above method won't fire the Start container's and package level OnError event handlers (since it's not being propagated). One way around this is to place a script task in the Get Id's OnError handler and just change the Dts.TaskResult = (int)ScriptResults.Success to Dts.TaskResult = (int)ScriptResults.Failure in the script. This script task will fail and propagate up to your higher level OnError handlers.

Saturday, 28 June 2014

SSIS and Minimally Logged Inserts

The SQL Server engine offers numerous methods to perform minimally logged inserts into an empty clustered table (http://technet.microsoft.com/en-us/library/ms191244%28v=sql.105%29.aspx) if we meet the prerequisites (http://technet.microsoft.com/en-us/library/ms190422%28v=sql.105%29.aspx), but what about when using SSIS? There’s quite a few places online that mention that it’s supported as an aside but don’t give away too much info on what the prerequisites are.

Going from the prerequisites listed for the other bulk inserts you’d think that a a sorted insert into an OleDb destination with Table lock turned on, Rows per Batch and Maximum Insert Commit Size left to the defaults, and an ORDER hint on the clustered column(s) would be enough, but is it?

Usually, no. Inserts into empty clustered tables (B-trees) via SSIS are minimally logged if the following is true:

The Bulk Insert task is used with a flat file sorted in clustered key order and the following options are set:
- Table lock: checked
- Batch size: 0
- SortedData has the clustered key order specified

only useful if your source is a ordered flat file. Not much use when moving data from another database table.

The SQL Destination task is used with a source sorted in clustered key order and the following options are set:
- Table lock: checked
- Order columns has the clustered key order specified

only useful when the SSIS package is executed on the same server as the SQL instance.

The OleDb destination task is used with a source sorted in clustered key order and the following options are set:
- Table lock: checked
- Rows per batch: blank (default)
- Maximum insert commit size: 0
- FastLoadOptions has order hint added: TABLOCK,ORDER(<Clustered Columns,..>)

And there you have it! Keep reading if you’re interested in the “why” and an explanation of the warning message you’ll get when setting the insert commit size to 0, but all the important info is above.

The Investigation

First, we need some data.

create table dbo.source_data (id int primary key, data char(100) not null)
go
create table dbo.destination_pk (id int primary key, data char(100) not null)
go
insert into dbo.source_data (id,data)
select top 850000 row_number() over (order by sysdatetime()), 'blah blah' from master..spt_values a, master..spt_values b, master..spt_values c
go
sp_spaceused 'dbo.source_data'
go

The above code will generate about 95 meg of data. I’ve tried to stay under 100 meg for an important reason that’s related to the internal SSIS buffers (but more on that later).

Next, we need some way to see what’s happening in the logs. For that we can use the undocumented fn_dblog() function (http://www.confio.com/logicalread/sql-server-dbcc-log-command-details-and-example/).

-- Log records and record size
SELECT COUNT(*)AS numrecords,
CAST((COALESCE(SUM([Log Record LENGTH]), 0))
/ 1024. / 1024. AS NUMERIC(12, 2)) AS size_mb
FROM sys.fn_dblog(NULL, NULL) AS D
WHERE AllocUnitName = 'dbo.destination_pk' OR AllocUnitName LIKE 'dbo.destination_pk.%';

-- Breakdown of Log Record Types
SELECT Operation, Context,
AVG([Log Record LENGTH]) AS AvgLen, COUNT(*) AS Cnt
FROM sys.fn_dblog(NULL, NULL) AS D
WHERE AllocUnitName = 'dbo.destination_pk' OR AllocUnitName LIKE 'dbo.destination_pk.%'
GROUP BY Operation, Context, ROUND([Log Record LENGTH], -2)
ORDER BY AvgLen, Operation, Context;

(The above SQL for retrieving the details for these tests was ruthlessly stolen from: http://sqlmag.com/t-sql/minimally-logged-inserts)

Baseline

We need to perform a minimally logged insert into our destination table to have something to compare SSIS against. I exported our test data into a text file in native format using bcp (c:\> bcp "SELECT id,data from dbo.source_data order by id" queryout c:\myout.txt -T -S. -dtest -n)

I then imported it back using BULK INSERT

bulk insert test.dbo.destination_pk from 'c:\myout.txt'
with(
DATAFILETYPE = 'native'
,TABLOCK
,ORDER(ID))

The results from fn_dblog() are

numrecords	size_mb
4716	0.31


Operation	Context	AvgLen	Cnt
LOP_SET_BITS	LCX_GAM	60	1529
LOP_SET_BITS	LCX_IAM	60	1529
LOP_FORMAT_PAGE	LCX_IAM	84	1
LOP_MODIFY_ROW	LCX_PFS	88	1649
LOP_MODIFY_ROW	LCX_IAM	100	8

This tells us that there were 4176 log entries that take up 0.31 of a megabyte. Quite a bit less than the 850,000 rows and 95meg of data, so we can be fairly sure the insert was minimally logged.

SSIS Scenario 1 – Default Data Flow

A simple data flow like the one described in the opening paragraphs. A default package with a data flow OleDb source (select id,data from dbo.source_data order by id) into an OleDb destination with the following settings:

Table lock: checked. Keep identity, Keep nulls and Check constraints unchecked
Rows per batch: blank (default)
Maximum insert commit size: 2147483647 (default)
FastLoadOptions has order hint added: TABLOCK,ORDER(ID)

OleDb Scenario 1

The data flow DefaultBufferMaxRows and DefaultBufferSize are left at the default of 10000 and 10485760 respectively.

The results from fn_dblog() are

numrecords	size_mb
963726	165.54


Operation	Context	AvgLen	Cnt
LOP_DELETE_SPLIT	LCX_INDEX_INTERIOR	60	43
LOP_SET_BITS	LCX_GAM	60	1530
LOP_SET_BITS	LCX_IAM	60	1530
LOP_INSYSXACT	LCX_CLUSTERED	62	47332
LOP_INSYSXACT	LCX_INDEX_INTERIOR	62	23928
LOP_MODIFY_ROW	LCX_PFS	81	13511
LOP_FORMAT_PAGE	LCX_HEAP	84	11833
LOP_FORMAT_PAGE	LCX_IAM	84	1
LOP_FORMAT_PAGE	LCX_INDEX_INTERIOR	84	44
LOP_INSERT_ROWS	LCX_INDEX_INTERIOR	84	11877
LOP_MODIFY_HEADER	LCX_HEAP	84	11833
LOP_MODIFY_HEADER	LCX_INDEX_INTERIOR	84	43
LOP_MODIFY_ROW	LCX_IAM	100	8
LOP_INSERT_ROWS	LCX_CLUSTERED	196	840170
LOP_INSERT_ROWS	LCX_INDEX_INTERIOR	5012	43

Not minimally logged, then? More log entries than we have rows, and over 50% more data. It’s also interesting that there are 840,170 individual inserts into the clustered index even though we had 850,000 rows. So that leaves 9,830 rows that appear to have been minimally logged. That looks suspiciously like the ‘average’ size we see for an SSIS buffer (you know how the rows in the dataflow seem to increase by approximately 10,0000 rows per refresh?). In fact if we turn on a data viewer we see that each full buffer contains 9,830 rows.

Buffer Scenario 1

What happens when we increase the data flow buffers from the default 10,000 rows and 10 megabytes?

SSIS Scenario 2 – Maximum Data Flow Buffers

A simple data flow like the one described in scenario 1. A default package with a data flow OleDb source (select id,data from dbo.source_data order by id) into an OleDb destination with the following settings:

Table lock: checked. Keep identity, Keep nulls and Check constraints unchecked
Rows per batch: blank (default)
Maximum insert commit size: 2147483647 (default)
FastLoadOptions has order hint added: TABLOCK,ORDER(ID)

But this time we also change the DefaultBufferMaxRows and DefaultBufferSize to 1000000 and 104857600 respectively. 100 megabytes is a hard limit in SSIS for the buffer size (remember how our source data table size is just under 100 meg?)

Data Flow Scenario 2

The results from fn_dblog() are

numrecords	size_mb
4716	0.31


Operation	Context	AvgLen	Cnt
LOP_SET_BITS	LCX_GAM	60	1529
LOP_SET_BITS	LCX_IAM	60	1529
LOP_FORMAT_PAGE	LCX_IAM	84	1
LOP_MODIFY_ROW	LCX_PFS	88	1649
LOP_MODIFY_ROW	LCX_IAM	100	8

The same as our baseline! So does this mean that we can only minimally log the first 100 megabytes of a data flow? But more importantly why is this happening?

According to the OleDb Destination page on TechNet (http://msdn.microsoft.com/en-us/library/ms188439.aspx) we have the following two settings:

Rows per batch
- Specify the number of rows in a batch. The default value of this property is –1, which indicates that no value has been assigned.
- Clear the text box in the OLE DB Destination Editor to indicate that you do not want to assign a custom value for this property.
Maximum insert commit size
- Specify the batch size that the OLE DB destination tries to commit during fast load operations. The value of 0 indicates that all data is committed in a single batch after all rows have been processed.
- A value of 0 might cause the running package to stop responding if the OLE DB destination and another data flow component are updating the same source table. To prevent the package from stopping, set the Maximum insert commit size option to 2147483647.
- If you provide a value for this property, the destination commits rows in batches that are the smaller of (a) the Maximum insert commit size, or (b) the remaining rows in the buffer that is currently being processed.

The above description reads as if 0 and 2147483647 are interchangeable for the Maximum insert commit size, and as non-zero value will avoid the issue with locking it should be the one to use.

But a non-zero value for the Maximum insert size is treated by SSIS as “just another number” whether it’s the default 2147483647 or not, and then the last bolded point comes into play. A batch for our minimally logged insert scenario is the smallest of:

Rows per batch. Blank means it’s ignored.
Maximum insert commit size: 2147483647 rows
DefaultBufferMaxRows: 10000 by default. Max of 2147483647
DefaultBufferSize, 10485760 bytes by default. Max of 104857600

It’s the data flow properties in most cases that end up dictating the insert batch size and that’s what we saw in scenario 1. Batches are very important when inserting into an empty B-tree because only the first batch is minimally logged, after then you’re inserting into a non-empty B-tree and are fully logged. Only the first 10meg batch was minimally logged in scenario 1 and the remaining data inserted was fully logged. The buffer defaults and maximums are documented here http://msdn.microsoft.com/en-us/library/ms141031%28v=sql.105%29.aspx

What about when we set the Maximum insert commit size to 0, warnings be damned?

SSIS Scenario 3 – Maximum Inset Commit Size Set to 0

A simple data flow like the one described in the above. A default package with a data flow OleDb source (select id,data from dbo.source_data order by id) into an OleDb destination with the following settings:

Table lock: checked. Keep identity, Keep nulls and Check constraints unchecked
Rows per batch: blank (default)
Maximum insert commit size: 0
FastLoadOptions has order hint added: TABLOCK,ORDER(ID)

OleDb Scenario 3

The data flow DefaultBufferMaxRows and DefaultBufferSize are left at the default of 10000 and 10485760 respectively.

he results from fn_dblog() are

numrecords	size_mb
4716	0.31


Operation	Context	AvgLen	Cnt
LOP_SET_BITS	LCX_GAM	60	1529
LOP_SET_BITS	LCX_IAM	60	1529
LOP_FORMAT_PAGE	LCX_IAM	84	1
LOP_MODIFY_ROW	LCX_PFS	88	1649
LOP_MODIFY_ROW	LCX_IAM	100	8

The same as our baseline and therefore minimally logged. The benefit of this setup is that it applies to any volume of data, whereas scenario 2 only helps with data volumes below 100meg. Moving less than 100 meg is a few second job fully logged or unlogged so leaving the Maximum insert commit size at the default and maxing out the DefaultBufferSize is kind of useless. But what about the informational warning we get from the OleDb destination about the package potentially not responding?

[OLE DB Destination [2]] Information: The Maximum insert commit size property of the OLE DB destination "OLE DB Destination" is set to 0. This property setting can cause the running package to stop responding. For more information, see the F1 Help topic for OLE DB Destination Editor (Connection Manager Page).

The help menu (and the excerpt from TechNet pasted in the SSIS Scenario 2 section above) states: A value of 0 might cause the running package to stop responding if the OLE DB destination and another data flow component are updating the same source table. To prevent the package from stopping, set the Maximum insert commit size option to 2147483647.

It only applies when multiple processes are inserting into the same table. We aren’t trying to have multiple processes insert data into our destination table concurrently, in fact we have a table lock enabled which would prevent this anyway. So this warning even doesn’t apply to minimally logged insert scenarios.

Epilogue – Trace Flag 610

Trace flag 610 allows minimally logged inserts into non-empty indexed tables, and turning it on actually does greatly reduce the log activity in scenario 1 (http://technet.microsoft.com/en-us/library/dd425070(v=sql.100).aspx).

numrecords	size_mb
67144	5.68


Operation	Context	AvgLen	Cnt
LOP_DELETE_SPLIT	LCX_INDEX_INTERIOR	60	37
LOP_INSYSXACT	LCX_CLUSTERED	62	284
LOP_INSYSXACT	LCX_INDEX_INTERIOR	62	364
LOP_SET_BITS	LCX_GAM	62	2234
LOP_SET_BITS	LCX_IAM	62	2234
LOP_MODIFY_HEADER	LCX_BULK_OPERATION_PAGE	76	9793
LOP_FORMAT_PAGE	LCX_BULK_OPERATION_PAGE	84	14160
LOP_FORMAT_PAGE	LCX_INDEX_INTERIOR	84	37
LOP_INSERT_ROWS	LCX_INDEX_INTERIOR	84	9793
LOP_MODIFY_HEADER	LCX_HEAP	84	19512
LOP_MODIFY_HEADER	LCX_INDEX_INTERIOR	84	37
LOP_MODIFY_ROW	LCX_PFS	90	6137
LOP_INSERT_ROWS	LCX_CLUSTERED	196	2485
LOP_INSERT_ROWS	LCX_INDEX_INTERIOR	5012	37

But just changing the Maximum insert commit size to 0 will result in a true minimally logged insert, so definitely go with that if you want a minimally logged insert into an empty indexed table.

Trace flag 610 is certainly something to look into after the initial load/migration of say large fact tables as it will usually result in a performance increase in subsequent incremental loads.