Recently I was talking with a customer and we were discussing what an ETL Framework was and what should a good one have. I had to pause for a second and really think about this. Overtime I have been either created or been involved with creating Frameworks for several projects and as with any code base you keep tweaking it over time but here is my initial list of what a good Framework should have.
- Flexible Execution Order
- Ability to Restart Either from the beginning or last failure point
Logging of the following
- Row Counts
- Easy to implement
- Easy to maintain
- Ability to send alerts
So what do each of these things actually mean? Well let me give a brief explanation of each one. A flexible execution order means that the order packages are executed should not be hardcoded or rigid but rather be able, with just changing a configuration setting, to change.
The ability to restart the ETL process is one that is quite important. For example if an error occurs in loading say a Fact Table would you want to have to re run the whole ETL? You shouldn’t have to go through the dimension table loads again when there was no issue loading them. Another example is what if you were loading 3 fact tables simultaneously and only one of the fact tables got an error? There would be no need to reload the other two fact tables only the one.
Logging to me is self-explanatory as you always want insight into what is going on in your ETL process. Row counts help when there is a performance issue and you can see if there was an increase in data volume size or sudden drop in row counts can show an issue with the source data. Knowing variable values help when you’re trying to debug and issue and replicate what was happening in production at that time. Errors well you always need to know that details of your errors to properly debug. Finally, duration helps you track performance over time and create performance reports that management always likes to see.
Easy to implement and maintain are vital to ensuring that the Framework will have a long shelf life and shortcuts won’t be taken to avoid using it. We all know developers are always looking for an excuse to cut corners and do things their own way.
Ability to send alerts is important because we all know we love getting woken up at 2 AM when the ETL has encountered an error. More importantly if there is an issue last thing you want is to be the last to know and walk into work in the morning with someone asking you, “why isn’t the data loaded?”
So the next question is how do you design a framework to handle all of this? Well that will be my next post! Can’t give away everything at once have to make you want to come back.