Use Lookups the right Way

SSIS Lookup tasks are all too easy to abuse as they are easy to setup and not really think about much. Most commonly developers just drag the task in setup a reference table, map a column or two, add a column to be returned and click OK. Now while doing that will work no doubt it can cause a performance nightmare. So let’s setup this scenario then break down why you have to think about taking the easy way out.

Scenario:

Let’s use our favorite friend AdventureWorksDW and say we want to load the FactInternetSales table, which has 8 dimensions keys and those dimensions would be.

  1. DimProduct
  2. DimDate (OrderDate)
  3. DimDate (DueDate)
  4. DimDate (ShipDate)
  5. DimCustomer
  6. DimPromotion
  7. DimCurrency
  8. DimSalesTerritory

So if we created a SSIS package to load the using the easy way described earlier we would have a Source and 8 Lookups just to handle getting the lookup keys portion of this package. If all I did was create my lookups as table references I would have a data pull that looked something like this.

TableName Number of Rows Number of times Queried Row Size (KB) Total Data Size (KB)
DimProduct 606 1 7.79 4720.74
DimDate 2191 3 .14 920.22
DimCustomer 18484 1 1.84 34010.56
DimCurrency 105 1 .11 11.55
DimPromotion 16 1 2.12 33.92
DimSalesTerritory 11 1 .30 3.3

However, we don’t need all the fields from these lookups for we only need the fields to join on and the field to set the Dimension Key for. Each of the table queries should look more like this.

  1. DimProduct (ProductKey, ProductAlternateKey)
  2. DimDate (DateKey, FullDateAlternateKey)
  3. DimCustomer (CustomerKey, CustomerAlternateKey)
  4. DimPromotion (PromotionKey, PromotionAlternateKey)
  5. DimCurrency (CurrencyKey, CurrencyAlternateKey)
  6. DimSalesTerritory (SalesTerritoryKey, SalesTerritoryAlternateKey)

Now the Data footprint will be the below.

TableName Number of Rows Number of times Queried Row Size (KB) Total Data Size (KB) Reduction in Size %
DimProduct 606 1 .05 30.3 99.36%
DimDate 2191 3 .007 15.337 98.33%
DimCustomer 18484 1 .03 554.52 98.37%
DimCurrency 105 1 .0098 .049 99.56%
DimPromotion 16 1 .0078 .1248 99.63%
DimSalesTerritory 11 1 .0078 .0858 97.4%

As you can see the Reduction in size is significant and it is effected in three areas: The Database Engine, Network and SSIS Engine all have to deal with less data now. So please never choose a table as your source but rather take the time to write the query and save yourself from performance headaches down the road and you might make your DBA happy too.

SSIS Dynamic Connections Part 1

One of the common questions that I see is, “How do I make a connection dynamic?” Well there are two different ways I typically see the need for dynamic connections. The first is for different environments and the second is to actually programmatically change the connection to load multiple files.

So the first one is rather simple and the best way to implement different connection properties per each environment is to use Configuration Files (I’m not covering the SQL 2012 options in this post). Typically what I recommend there is:

  1. Create Variables to store ServerName, DatabaseName, Username and Password
  2. Enable Configuration files for your package
  3. Map the Variables to the Configuration Files

First to setup this example I set up package that has one connection in the connection manager called DemoSQLConnection and one package called ConnDemo.

To create Variables first make sure you can see the variables pane, you can do that by right clicking on the background of the Control Flow Tab and selecting Variables.


Alternatively you can go to the Menu bar and select View  Other Windows Variables


Once you can now see your variables window let’s now create Variables for our Connection.

  1. DemoSQLServer – Used to store the server name to connect to
  2. DemoSQLDatabase – Used to store the database name to connect to
  3. DemoSQLUserName – Used to store the User Name for the connection
  4. DemoSQLPassword – Used to store the Password for the connection

Make sure you always have your variables in the correct scope and have the proper data type, for this the scope should be at the package level and the type should be string. I have also put in some default values for the variables.


If you are using windows authentication for your connection you can skip 3 and 4. (Remember the Windows user running the package must have access to the database)

Now that I have created the Variables I can enable the Configuration Files and map the variables. To do this you can right click on the background and select Package configurations or go the Menu and select SSIS Package Configurations



Once selected a window should open and you want to select the checkbox for Enable package configurations and click Add


For this example I am going to add a XML Configuration File, name it democonfig and place it in the c: drive


Click next and then add your variables to the Configuration file. You might wonder why variables and not just put the connection string in the config, two reasons I do this. First is because sometimes you create different connection types to the same server such as an OLEDB and a SQL Destination and it’s easier to just map the variables to the connection strings and define once than it is to modify both connection strings. The second reason is if the password is in a variable you can encrypt that variable (I won’t be discussing how to do that in this post)


Click next and then Name the Configuration and click Finish.


Now one last step is to map the Variables to the connection string. I prefer to do this with a variable. Create a Variable called DemoSQLConnectionString and in the properties window change the Evaluate as Expression to true and click the icon in the Expression property.


Set the Expression to

“Data Source=”+ @[User::DemoSQLServer] +”;User ID=”+ @[User::DemoSQLUserName] +”;Provider=SQLNCLI10.1;Initial Catalog=” + @[User::DemoSQLDatabase] + “;Password=” + @[User::DemoSQLPassword]

And then press Evaluate Expression to test it.


One last step now and that is set connection string of the connection to the variable. To do this we do the following. We select the connection from the connection manager and edit the expression property of ConnectionString and set it to the @[User::DemoSQLConntionString] variable.


Now you can test this by creating a Data Flow and using this connection and if you want to test to make sure it’s all working correctly go and change the password or server in the configuration file and the connection should fail.

Now as for the second way of dynamically setting connections during run time. That will be part two of this as this has already got a little on the long side.

PART 2

SSIS Dynamic Connections Part 2

SSIS Dynamic Connections Part 1

If you didn’t see part one I suggest you click the link above as I will be continuing from there.

For Part 2 we are going to show how to create a connection and then change the properties of the connection to process multiple files. So to set this up I created a Table called FilesToProcess in a database called ConnDemo.

Create
Table FilesToProcess

(

FileName
nvarchar(50),

FileLocation nvarchar(150)

)

With the following values.

I also created a table called StgImportFiles to import the files into.

CREATE
TABLE StgImportFiles

(

Col1 NVARCHAR(50),

Col2 NVARCHAR(50),

Col3 NVARCHAR(50)

)

What we are going to do is query this table to get all the files to process and then process each File by inserting them into the same table. Now remember to do this all the files have to have the same format.

I am going to expand on the Package from Part1 and just change the database name in the configuration file to be ConnDemo and then I will doing the following tasks.

  1. Add variables to store the file location and name as well as the return values of the query.
  2. Add Execute SQL Task to get the values from the FilesToProcess Table
  3. Add Foreach Loop Container to process the files
  4. Add DataFlow task to import the data.

The variables I am going to add are

  1. FileLocation – Store the Location of the File
  2. FileName – Store the Name of the File
  3. FileListResults – note that this is a Data Type of Object as this is required

Now I am adding an Execute SQL Task and Naming it “Get Files To Process”, setting the connection to be the DemoSQLConnection, setting the ResultSet to Full result set, then adding the following sql statement.

select FileLocation,FileName

from FilesToProcess

Then add a Result set and map it to the FileListResults variable. Your Execute SQL task should look like below and You must set the Result Name to 0 as that is the position of the result set.

Now we add the Foreach Loop Container and add the collection with enumerator “Foreach ADO Enumerator” and the map to the FileListResults variable. Then add the variable Mappings of FileLocation and FileName and should look like the screen shots below.

Next we add a Data Flow inside of the Foreach Loop and Create a File Connection Manager called Importfiles we will point this initially to one of the files we want to import so that we can properly set the format of the files.

Now that we have the Source File we can modify the connection string expression of the File to be @[User::FileLocation] + @[User::FileName]. This is what will now change the path of the file as each row of the returning query goes through the Foreach Loop.

The last step is to add to the DataFlow a Source pointing the File Connection Manger Imporfile and a destination pointing to the DemoSQLConnection and selecting the StgImportFiles table. Your package should now look like the below.

If you run the package the values in the StgImportFiles table should be there.

This process can be very useful if you receive multiple files in the same format that need to be imported and don’t want to write separate packages for each file. I have used this approach where we needed to import files for over 200 customers and the layout was a standard one that we had created and I only created one package to import all 200 files.

What should a SSIS Framework have?

Recently I was talking with a customer and we were discussing what an ETL Framework was and what should a good one have. I had to pause for a second and really think about this. Overtime I have been either created or been involved with creating Frameworks for several projects and as with any code base you keep tweaking it over time but here is my initial list of what a good Framework should have.

  1. Flexible Execution Order
  2. Ability to Restart Either from the beginning or last failure point
  3. Logging of the following
    1. Row Counts
    2. Variables
    3. Errors
    4. Duration
  4. Easy to implement
  5. Easy to maintain
  6. Ability to send alerts

So what do each of these things actually mean? Well let me give a brief explanation of each one. A flexible execution order means that the order packages are executed should not be hardcoded or rigid but rather be able, with just changing a configuration setting, to change.

The ability to restart the ETL process is one that is quite important. For example if an error occurs in loading say a Fact Table would you want to have to re run the whole ETL? You shouldn’t have to go through the dimension table loads again when there was no issue loading them. Another example is what if you were loading 3 fact tables simultaneously and only one of the fact tables got an error? There would be no need to reload the other two fact tables only the one.

Logging to me is self-explanatory as you always want insight into what is going on in your ETL process. Row counts help when there is a performance issue and you can see if there was an increase in data volume size or sudden drop in row counts can show an issue with the source data. Knowing variable values help when you’re trying to debug and issue and replicate what was happening in production at that time. Errors well you always need to know that details of your errors to properly debug. Finally, duration helps you track performance over time and create performance reports that management always likes to see.

Easy to implement and maintain are vital to ensuring that the Framework will have a long shelf life and shortcuts won’t be taken to avoid using it. We all know developers are always looking for an excuse to cut corners and do things their own way.

Ability to send alerts is important because we all know we love getting woken up at 2 AM when the ETL has encountered an error. More importantly if there is an issue last thing you want is to be the last to know and walk into work in the morning with someone asking you, “why isn’t the data loaded?”

So the next question is how do you design a framework to handle all of this? Well that will be my next post! Can’t give away everything at once have to make you want to come back.

Business Intelligence and Financial Services

A frequent question for enterprise executives is: why do I need BI?  Occasionally, I have to put elaborate presentations together showing what BI is and how it fits each company. However, when it comes to financial services, and more specifically the banking industry, there are only two lines that are really needed:

  1. Regulatory Reporting    
  2. Risk Management    

Anyone familiar with either of these and the concepts of BI might just stop reading right now.  However, since that would make for a short blog, let me continue to explain why I think it’s so simple.

Let’s start with the basics of BI which, to me, are Reporting (Operational and Ad-hoc), Analytics, Dashboards, Discovery, and then Learn and Repeat.  Each is a separate step and should be treated as an iterative approach.

Business Intelligence and Financial Service Illustration

Regulatory reporting requirements from the FDIC, NCUA, SEC, SOX, HIPPA, etc. all require various types of reporting. Specifically with the FDIC, NCUA and SEC, there are monthly, quarterly and yearly reporting requirements ranging in all areas from Deposits, Securities, Safe Deposit, CD’s, Secured Loans, Unsecured loans, etc.  All these reports could be automated and easily generated through the Operational Reporting piece of BI, not to mention all the other reporting a financial organization needs to be able to handle. Ad-Hoc reporting is great for spontaneous or last-minute reports. Maybe someone asks about the status of delinquent loans 20 minutes before a meeting. Maybe you’re working on 13-month rolling audits, and the auditors ask a question and they need answered by end of business. You can easily use Report Builder or Power View in SQL 2012 in order to produce what they need.

Risk Management encompasses Analytics, Dashboards, and Discovery. It’s important to learn from that discovery, and monitor what you’ve learned. The best way to do this is to go back to the basics and leverage operational reporting.

Let’s take a look at this scenario:  You have a problem where you noticed delinquencies on your auto loans have increased by 50% over the past six months.  First, you open Power View in SQL 2012 that is attached to a Loan Performance Analysis Services Cube.  You start to analyze your loans and you find that 90% of the increase in delinquencies is caused by borrowers that had FICO scores of 650 and below, had the loan for one year and a payment to income (PTI) of greater than 20%.  Now, you have analyzed and discovered the issue.   You decide to make a change in your lending practices to charge higher interest rates for borrowers in this category, which in turn reduces your risk of loss on the loan.

So are we done now?  Of course not.

We need to see if this change in lending practices actually made a difference.  There are a couple of ways we can do this.  First, we could create a report that starts monitoring loans generated after the decision date. Since we learned the borrowers only started becoming delinquent after a year, it should take 12 months to see the results.  Next, we could create a dashboard that monitors both loans before and after the decision and put that dashboard into the CFO and head of loan’s daily view.   We could even create a KPI for goal of delinquency and monitor against that. All of these could easily be created within SharePoint 2010 using Reporting Services and/or PerformancePoint. 

One last quick note: the FDIC requirement for stress testing could also be easily created using an Analysis Services cube and enabling write back for what if scenarios.

Rather than ask how business intelligence can help your organization, take some time to think about the increased power you could have in backing your decisions with the help of BI. The more insight into each decision, the better your company will be positioned for the future. Business intelligence and financial services go hand-in-hand.

PASS Summit 2012

 

So another PASS summit has come and gone. I had a great time and not only was it technically informative I also got to meet some great people. My company had a great presence as well with 7 attendees this year and a speaker for two sessions, so a congratulations goes out to Steve Hughes, www.dataonwheels.com , for his efforts. For those of you who could not attend I challenge you to do all you can to attend next year in Charlotte, NC and I am going to try and point out some of the highlights of this year’s conference.

 

Wednesday morning started off with the first Keynote speech by Ted Kummert where several important announcements were made.

  1. Project “Hekaton” was announced, more to follow below about this. Oracle has already made some responses Oracle Response
  2. SQL 2012 SP1 was launched
  3. PASS Business Analytics conference in Chicago April 10 – 13 in Chicago. http://www.passbaconference.com/
  4. Columnstore Indexes to be updatable and clustered
  5. Next version of SQL PDW will be available in 1st quarter of 2013
  6. PolyBase, a new tool for allowing traditional T-SQL statements to be executed against Hadoop, will be released by June 2013
  7. Powerview is native to Excel 2013 and includes the ability to execute DAX queries against traditional MOLAP cubes.

 

While all these updates are all individually important the most important ones to me are project “Hekaton” and Powerview native to Excel 2013. Hekaton is the next major enhancement to in memory technology on the SQL stack and bringing OLTP into the in memory space. During the demo of Hekaton several different examples where shown. The first was showing the performance increase by just putting the table into memory and a performance increase of about 13x was achieved. They followed that up by putting both the table and the stored procedure in memory and a performance increase of around 60x was achieved. Now while this was a demo and real world scenarios are always more of a true test it was still rather impressive to see this advancement and what it shows for the future of SQL Server and competing with other database platforms.

Not as for Powerview not being native to Excel 2013. This is one that really changes how business can bring analytical reporting to their end users. The requirement for SharePoint in order to use Powerview is no longer there, and in my opinion was a big road block to the adaption of Powerview. However, with it now being native to Excel you can have a robust reporting engine from within a tool that the users are already familiar to using and combine that with the support for traditional Multidimensional cubes and you now have a true reporting tool that I think will be adopted by business very quickly.