SQL Server Business Intelligence and Data Warehousing: 2015

Anyone that’s tried using SSAS as a data source (whether in an ETL tool or a data dump using TSQL openrowset and/or linked servers) might have noticed that longer queries seem to run for twice as long when compared to MDX executed directly against the cube in SSMS.

The reason for this is the database engine and SSIS need to know the ‘shape’ of the dataset. This is fine for most datasets because a relational query has its schema defined by the underlying data and a ‘get schema’ command to the data source returns the dataset metadata only.

The nature of mdx is such that the schema of the resultset could be driven by the data itself (eg NON EMPTY could result in columns being excluded) and is not known until the query is returned. So when the SSAS server gets a connection and query that’s only requesting the metadata it executes the query (to define the metadata) and returns that. A split second later the command for the data comes in and SSAS obliges and runs the query again.

Using a linked server or SSIS (with an oledb datasource pointing to an SSAS cube) to execute a query against SSAS results in the query running twice. Once to return the metadata header and again to return the data.

From the Profiler:

This is totally unnecessary for SSIS because the data flow already has the schema defined, and if the returned dataset is different then the Data Flow Task will fail during execution anyway. Setting Delay Validation = true and ValidateExternalMetadata=false makes no difference. The MDX query is still executed twice.

In most cases this is fine as queries are answered fairly quickly, and caching etc would probably help with the same query running in quick succession. But in some cases (like mine! - when the query is a very complex calculation driven leaf level query that can take minutes to hours to execute), the query taking twice as long just isn’t going to cut it.

Connecting to SSAS

There are two ways to connect to SSAS from SSIS using the built in data flow source connectors - OLEDB and ADO.NET

What’s Actually Happening?

OLEDB

The ‘Query Start’ event of SSIS executing a query against SSAS has some interesting additional properties. Comparing the two query executions (metadata and data) shows these differences:

<Content>MetaData</Content> and <ExecutionMode>Prepare</ExecutionMode> are the culprits here.

Is there any way to stop this first one from running or forcing it to behave like the second one (so that it also returns the data)? No, none that I could find :(

I tried to use the Extended Properties area of the connection in SSIS to modify these two properties but had no luck. Setting Content to SchemaData resulted in an error because data isn’t compatible with execution mode prepare. There is no doco on these two connection properties anywhere and I’m not a good enough coder to dive into the DLLs to figure out what the valid options are.

Maybe ADO.NET will work better?

ADO.NET

The slower and less functional cousin of OLEDB (no fast load, native query from variable support) also suffers from firing the same query twice. Probably because it’s simply the same OLEDB SSAS driver wrapped up in ADO.NET (whatever that means!)

It however sends slightly different properties:

Notice how the metadata query is has fewer properties. The data query adds on <Content>SchemaData</Content> and <Format>Tabular</Format>.

Could simply adding Content=SchemaData and Format=Tabular to the extended properties of the ADO.NET connection be enough?

Yes! It is!

Conclusion

Unfortunately the easier to use OLEDB data flow connection double dips when extracting data. ADO.NET is the only other option if time is a factor.

Simply add the following to your ADO.NET Connection Manager under All|Advanced|Extended Properties to stop it from querying the cube twice

Content=SchemaData;Format=Tabular

Enjoy!

Linked servers + stored procs used to be the go to ETL method in the good old days. But linked servers fell out of favour as ETL tools became more powerful, and physical data marts/warehouses became the flavour of the decade.

With the advent of in-memory analytics solutions they’ve made a bit of a comeback. The reason being, if the data is simple enough you can transform it into a star schema with views over the top of the operational data. These views are used as the source objects for loading into the in-memory analytics tool of choice (Power Pivot/Tabular cubes, Tableau, etc).

It’s a much more rapid dev/agile approach and delivers ROI within days/weeks as opposed to months (or longer!). There’s no monolithic ETL to manage, or need to acquire and manage a high performance server that only does stuff for an hour or two at night, then is only used for weekly or end of month reporting.

Database servers nowadays are fast enough to deliver data from views (you’d be doing the same transformation logic work, probably a few times over, to move the data from the OPS system to staging in your DM to merging into the final dim/facts, to loading into your cube or analytics package), and most organisations have at least one if not more data marts available already. Let the existing infrastructure do the work for you.

One scenario recently involved a number of horizontally partitioned databases. The schemas were identical, but the databases were partitioned by region. Each region had its own version of the database, some on different servers. And different functions were again separated (operational OLTP databases and customer details in the SAP databases)

I needed to transform the data into a star schema using a number views to UNION ALL the tables. But all the data needs to be visible from the same server instance, hence the linked servers.

There’s a few restrictions with the GUI tool that didn’t let me do what I wanted to and I couldn’t find much in the way of recent online details on how to do weird stuff with linked servers, such as:-

Giving a linked server to SQL Server a ‘logical’ name that differed from the physical instance name (e.g. so I could move my view code from dev to test to prod without needing to refactor my linked server queries)
Linking a server back to itself for situations where the infrastructure in dev didn’t match that in prod (e.g. all dev databases for the data that I’d be querying were on the same server, but were spread over multiple servers in production)

Here’s some snippets that’ll let you create circular linked servers (not allowed via the SSMS GUI), and also linked servers to MSSQL with a ‘logical’ name that differs from the server name (again, something not allowed via the SSMS GUI)

Linked Server with Different Logical Name

Say you want to create a linked server to instance [SQLDev\test1] and call it CustomerServer. You can put anything you want into the @datasrc parameter, including the server that you’re creating the linked server on.

Create the Linked Server

EXEC master.dbo.sp_addlinkedserver

@server = N'CustomerServer',

@srvproduct=N'MSSQL',

@provider=N'SQLNCLI',

@datasrc=N'SQLDev\test1'

Create the Credentials

Now you just need to add the credentials. Here are two examples, one that leverages the AD credentials of the user trying to use the linked server (user needs to exist on destination server and have permissions to the objects they’re trying to access), and another using an SQL service account with username customerdata and password customerdata (SQL user needs to exist on the destination server and have the required permissions to access the requested objects)

AD pass-through Authentication - needs Kerberos when double hopping

EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N'CustomerServer',

@useself=N'True',

@locallogin=NULL,

@rmtuser= NULL,

@rmtpassword= NULL

SQL Service Account - don’t forget to turn on mixed authentication on the target server

EXEC master.dbo.sp_addlinkedsrvlogin

@rmtsrvname=N'CustomerServer',

@useself=N'False',

@locallogin=NULL,

@rmtuser= N'customerdata',

@rmtpassword= N'customerdata'

Caution

Attempting to modify this linked server via the SSMS GUI will probably throw all kinds of weird errors. Just delete them and re-create if you need to change anything.

SQL Server Business Intelligence and Data Warehousing

Friday, 4 September 2015

SSIS - SSAS as Data Source - MDX Query Slow (Because it’s Executed Twice!)