Wednesday 29 October 2014

SSISaaS (Integration Services as a Service) - Azure Data Factory

Microsoft recently announced the public preview of Azure Data Factory. The blurb makes it sound like cloud SSIS:

Compose data storage, movement, and processing services into streamlined data pipelines

Drive business decisions with trusted information. Tie your unstructured and structured data ecosystems together across on-premises and cloud sources using Hadoop processing. Monitor your pipelines and service health using the Azure portal. Consume trusted data with BI and analytics tools and applications.
http://azure.microsoft.com/en-us/services/data-factory/

But don’t be fooled!

Unfortunately in reality it’s a weird hack that requires a collection of hand written json scripts that hardcode the source,transform and destination metadata and need to be registered from your local machine using powershell.

The ADF documentation includes a simple example that copies the contents of a two column, two row CSV into an azure database table. http://azure.microsoft.com/en-us/documentation/articles/data-factory-get-started/

Setup tasks aside, I’ve shown the equivalent development objects for this side by side below (SSIS on the left, ADF on the right)

SSIS vs ADF

Recreating a simple CSV->DB table SSIS data flow requires 100ish lines of manually written json scripts (the azure data factory quickstart guide suggests using "notepad") split across multiple files and then locally executed azure powershell cmdlets to 'register' the json with the data factory.

Yes, that scripting needs to be freehanded. Yes, you need to manually upload the individual files to your azure blob storage. Yes, you need to manually execute azure powershell cmdlets locally to ‘register’ your uploaded json to the data factory. No, there is no pipeline/json authoring webapp in the ADF console that has templates, snippets, intellisense or some kind of syntax validation.

I don’t know whether I’m looking at this wrong, but moving even a trivial ETL workflow into this would be a significant effort. And debugging that json…………….. No thanks. I haven’t tried messing up the CSV or introducing a breaking change to the destination schema to see what kind of useful errors it spits out yet but scripting error messages tend to be fairly cryptic and I don’t see why this would be any different.  the error messages for the trivial schema issues I introduced were pretty good actually…

The cost looks to be about a dollar a run for the above workflow (3 activities: read, copy, load) unless I’m reading it wrong (I think I am, as I’ve left the simple example experiment running indefinitely and it’s only showing as 0.58 activities). I might execute an SSIS project 20-50 times a day during development. A few (dozen) ADF dev runs to iron out some bugs might burn through the funds a bit too quickly.

I’m not sure what the purpose of this is. It sounds like there’s some cool stuff in there in regards to data monitoring and such, but not at the cost of basic ETL usability. Is it meant to be an ETL or (meta)data analysis tool for web devs, hence the reliance on json?

I can’t see myself using or recommending this in its current state for any kind of BI or data analytics ETL/ELT/ELTL work. It doesn’t do anything unique and the lack of a development gui for common tasks is a deal-breaker.

If you have a large volume of data in the cloud then it might be more worthwhile to spin up in the same affinity group as your blob storage a dedicated VM with SSIS installed and process it there. Use the new Azure Automation feature to spin this VM up and down on a schedule. A beefy VM will set you back a dollar a two an hour. And you can do your dev locally for free*.

This is a shame, and a bit of a shock really. I’ve been playing around in Azure Machine Learning, another azure preview offering, and the webapp gui in that is brilliant. It’s probably less buggy than the current SSDT 2012 SSIS gui! You can drag/drop to build experiments to train your models and publish them from within the gui.

ML GUI 
(the ML GUI)

Wednesday 1 October 2014

Generate Master Package from msdb.syspackages using BIML

Just a quick one – here’s some code that will generate a master package from the data in your msdb.syspackages table. It’s very generic and doesn’t use any custom metadata tables or objects. It’s standalone code that should successfully execute as is.

I usually use a modular approach for loading a data warehouse. I construct one package per target object and aim to have them as independent as possible. This allows me to re-use the same package for multiple workflows. I can use many methods to control the workflow, but usually it takes the form of a master package that is predominantly made up of Execute Package tasks.
It doesn’t take long for for the number of packages in even a small project to start to add up. And manually creating the master package is not a good use of anyone’s time!

The code has been tested in 2008R2 and 2012 (package deployment model only) and assumes you’ve used the SQL Server as your package store. It parses the msdb SSIS Packages table and creates a sequence container for the specified root and every child folder.

An Execute Package task is created in parallel mode in every sequence container that points to the packages that exist in that folder in your SSIS repository. There are two variables that need to be changed. The sServer variable which refers to the SQL server instance that the SSIS packages are deployed to, and the sRootFolder which refers to the SSIS repository folder that holds the packages you want in your master package. It includes all child folders and adds any packages they contain.

<# String sServer = @".\sql2k12";#>      <!--    Change this to the server holding the MSDB database-->
<# String sRootFolder =@"";#>            <!--    Change this to the root SSIS folder for which to generate master package.
                                                 Packages in this folder and all child folders will be included in the master package-->

<#@ template language="C#" hostspecific="true"#>
<#@ import namespace="System.Data" #>   
<# String sConn = String.Format("Provider=SQLNCLI10;Server={0};Initial Catalog=msdb;Integrated Security=SSPI",sServer); #>
<# String sSQL = String.Format(@"
;with cte as (
    SELECT  pf.foldername AS RootFolderName
    ,       pf.folderid AS FolderID
    ,       convert(nvarchar(4000),pf.FolderName+'\') AS FolderPath
    ,       pf.FolderName AS FolderName
    ,       cast(null as sysname) as ParentFolder
    from        msdb..sysssispackagefolders pf
    WHERE   pf.foldername = '{0}'
    union all
    SELECT  cte.RootFolderName AS RootFolderName
    ,       pf.folderid AS folderid
    ,       convert(nvarchar(4000),cte.FolderPath+pf.foldername+'\') AS FolderPath
    ,       pf.FolderName as FolderName
    ,       cte.FolderName AS ParentFolder
    from        msdb..sysssispackagefolders pf
    INNER join  cte
    on              cte.folderid = pf.parentfolderid
)
select  RootFolderName AS RootFolderName
,       pf.FolderPath AS FolderPath
,       pf.ParentFolder AS ParentFolder
,       pf.FolderName AS FolderName
,       sp.name as PackageName
,       ROW_NUMBER() OVER (PARTITION BY pf.FolderPath ORDER BY sp.Name ASC) AS StartId
,       ROW_NUMBER() OVER (PARTITION BY pf.FolderPath ORDER BY sp.Name DESC) AS EndId
from        cte pf
left join   msdb..sysssispackages sp
on              sp.folderid = pf.folderid
Order By FolderPath, PackageName
",sRootFolder); #>
<Biml xmlns="
http://schemas.varigence.com/biml.xsd">
       <Connections>
  <Connection Name="SSIS Package Store" ConnectionString="Data Source=<#=sServer#>;Initial Catalog=msdb;Provider=SQLNCLI10.1;Integrated Security=SSPI;Auto Translate=False;Application Name=SSIS Package Store;" /> 
       </Connections>
       <Packages>
             <Package Name="Master Package" ConstraintMode="Parallel" AutoCreateConfigurationsType="None">
                   <!--Tasks-->
                   <Tasks>
                        <# DataTable tblPackages = ExternalDataAccess.GetDataTable(sConn,sSQL); #>
                        <# foreach (DataRow row in tblPackages.Rows){ #>
                            <# if(row["StartId"].ToString() == "1") { #>
                                <# if(row["FolderName"]=="") { #>
                                    <Container Name="\" ConstraintMode="Parallel">
                                <# } else { #>
                                    <Container Name="<#=row["FolderPath"]#>" ConstraintMode="Parallel">
                                <#}#>
                                <# if(!row.IsNull("PackageName")) { #>
                                    <Tasks>
                                <# }
                            }    #>
                            <# if(!row.IsNull("PackageName")) { #>
                                  <ExecutePackage Name="Execute <#=row["PackageName"]#>">
                                     <SqlServer ConnectionName="SSIS Package Store" PackagePath="\<#=row["FolderPath"]#><#=row["PackageName"]#>" />
                                </ExecutePackage>
                            <#}#>       

                            <# if(row["EndId"].ToString() == "1") { #>
                                <# if(!row.IsNull("PackageName")) { #>
                                    </Tasks>
                                <# } #>

                                </Container>
                            <#}    #>
                    <# }#>
                   </Tasks>
             </Package>
       </Packages>
</Biml>

Create a new BIML file in your solution, paste the code in and just specify the connection string to your MSDB database and the root folder (inclusive) in your SSIS package store of the packages you want to be part of the master package.

An Execute Package task is created in parallel mode in every sequence container that points to the packages that exist in that folder in your SSIS repository. It should create a package with a separate sequence container for each folder and every package placed in parallel mode in the relevant sequence container. There’ll also be a connection to your ssis package store.

It doesn’t do nested folders but feel free to add it in – it’d need either a recursive TSQL function or a recursive .NET function hacked into the BIML code. BIML has no .Net debugger  and intellisense doesn’t work, so writing any kind of complex .net code is just asking for trouble I think.

This is the type of task that I see myself using BIML for a lot more; automating the tedious, simple parts of SSIS package and solutions. Manually creating a execute package task takes probably close to a dozen clicks and mouse drags that simply pick things off a list. This is what scripting like this excels at. More complex logic like precedence constraints or automating SSIS creation…. It doesn’t take long for the helper scripts to become more complex, harder to debug and take longer to maintain than the files they create.

Say I have a project I’ve deployed to a MasterDataLoad directory which I specify as my sRootFolder. The BIML code will query the following SSIS repository:
image
and output the following package:
SSIS Master Package
Now all I’d need to do is rearrange it , rename or remove any sequence containers I don’t need and add any precedence constraints specific to the workflow I’m creating.