Posted by Ray Rashid
March 18, 2016
Have you embarked on a journey to create the most insightful dashboard of all time just to realize you have disparate data?
Whether you are a new analyst entering the work force or a seasoned business intelligence specialist, chances are you have or will encounter the need to combine data from different sources to properly answer your business questions. The reality is visualization software offers limited capabilities in data blending.
In the late 80’s Barry Devlin coined the term, “Business Data Warehouse”. This centralized repository would store data for all interested business information systems. However, unlike a traditional warehouse, data is migrated from the production environment to this analytical destination. Similar to a traditional warehouse this centralized location would house information from one or more production systems. The methodology of moving data from a production environment to an offline storage is known as ETL; Extract, Transform, Load. Each letter in the acronym stands for an individual phase of the process.
Debatably the most important part of ETL, this initial phase begins with extracting data from the source systems and ends when the required inputs are centralized in a common environment.
In the second phase you employ business rules, use functions, and cleanse your data.
Lastly you move data out of the staging environment and into a data warehouse for reporting and analytics.
The data contained in a data warehouse is not meant to be a replica of the production database. In fact, where your production database may be well structured and normalized, it is typical and usually recommended that your data warehouse be the opposite. De-normalization allows your data warehouse to handle complex queries better than a production database by reducing the number of complex joins and number of aggregations.
If you are reading this article, you are probably well aware of the numerous ETL tools and countless tutorials. Although ETL tools may have strategic applications, just about any tool and tutorial combination can get you started. This data migration methodology, once started as a formal IT process, has grown to become a very loose term among business professionals. Many set off to design and implement their first data warehouse project. What seems like low hanging fruit, can quickly turn into a quagmire.
Now that we have cleared up common ambiguity with ETL and Data Warehouses, we will try to alleviate some of your current or future struggles by providing “4 Winning Strategies” to solve common data preparation challenges when preparing data for dashboards.
I am by no means someone who likes to spend countless hours composing documents. I’d rather jump in and get started. However, if we do not calm our urges and do the necessary preparation, we set ourselves up for difficulties, if not failure, in the near future. At the minimum, you should have two supporting documents.
This document will contain your project scope, your systems of record (SOR) and deliverables. This may be the single most import document you create. Clients, stakeholders and the like are notorious for always wanting more, and changing their minds. Small changes can and should be encouraged. However, what may seem a minor change could have a heavy impact. The BRD should detail the unequivocal expectations between both parties.
This diagram is pivotal in the initial stages of any ETL project. It will demonstrate which data source inputs relate to one another and the method used to create the relationships. Further, it will also show the types of mappings, whether it be through direct, surrogates or fuzzy matches.
This visual representation will demonstrate how the source systems relate to each other, cardinality between tables and contained columns within each entity. This entity relationship diagram (ERD) will allow you to reference your proposed data model and put it into practice.
Any time you join data from different systems, databases or even schemas you will likely run into an issue where there is no direct key mapping. Fortunately, we can still create a relationship between these disparate sources. Fuzzy Matching, although not with absolute certainty, can create a record linkage between two entities. How does it work? You choose common columns that have high cardinality between two or more entities.
Next you choose an algorithm based on the kind of data you have. Lastly you choose a threshold limit. Remember, we said fuzzy matching is not 100% although it can be quite precise. If your mean match percentage is not acceptable, think about using another algorithm, or a conjunction of many algorithms and score which model(s) works the best. Although it is unlikely to get an exact match between your data systems, fuzzy matching can provide a solution with a high degree of certainty.
Inefficient processing can turn your hard work into a disappointment very quickly. No one likes to wait, and we should do all we can to eliminate any inefficiencies. There are two ways to increase script throughput, and in most cases they require minor additional effort: parallel and serial processing. If two or more processes have no dependencies on one another, it is senseless to have them execute one after the other (i.e., serially). This multiplies the time it would take if they were run in parallel.
Serial processing comes with a fairly heavy performance impact. Processing in parallel can considerably reduce execution times, but make sure you’ve identified any dependencies between scripts so your scripts don’t trample each other.
If you are parallel processing and still faced with performance issues, you should turn your attention to differential processing. Should we drop an entire table for the insertion of a single new record, or how about an update to a single record? Would it not make sense to only process the difference between the parent and child entities? It does!
Let’s take a look at an example. If Table A takes 10 minutes to load 10 records, each individual record takes one minute. If we are given one net new record to add, we can either drop Table A and reload with now 11 records, or insert the one net new record to the already existing Table A. If we take our premise that one record takes one minute to load, the drop and recreate method will take us 11 minutes to load, whereas our insertion method will only take one minute. Although the example ratio I’ve provided is embellished, the underlying premise is true. There are significant time saving advantages with the use of differential processing.
Although there are many more common data preparation/ETL challenges, I’ve identified strategies to solve the 4 biggest ETL challenges based on our experiences at Altair BI. If you want to create the most insightful dashboard of all time, follow the Top 7 Data Visualization Best Practices, and if you run into difficulty with your data warehouse project, Altair BI can help.