Tuesday, 22 August 2017

Type 1 SCD for Hadoop (Free Script)

What are Slowly Changing Dimensions?

Slowly Changing Dimensions (SCD) - Dimensions that change slowly over time, rather than changing frequently or on regular schedule, time-basis. In Data Warehouse there is a need to track changes in dimension attributes in order to report historical data. In other words, implementing one of the SCD types should enable users assigning proper dimension's attribute value for given date. Example of such dimensions are: Customer, Address, Employee.

Below is a list of the Slowly Changing Dimensions in use today:

  • Type 0 - The passive method
  • Type 1 - Overwrites the old value, no changes are tracked
  • Type 2 - Creates a new additional record to track changes
  • Type 3 - Adds a new column to track changes
  • Type 4 - Uses a separate historical table to track changes
  • Type 6 - Combines Type 1,2,3 approaches (1+2+3=6)
In this article we will discuss how to ingest data into Hadoop Big Data environment using the Type 1 Slowly Changing Dimension approach. At the end of this article you will also find a link to download the free script (a complete framework) that implements this.

In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept.

In this example, in the Customer Dimension, we have the following record:

Customer Key Name State
5001 Peter Washington

After Peter moved from Washington to Maine, the new information replaces the original. In this case, Washington is replaced with Maine, and after the data ingestion Customer Dimension looks like this:

Customer Key Name State
5001 Peter Maine

Advantages:

This approach is easy to implement, since there is no need to keep track of the old information.

Disadvantages:

All history is lost. By applying this approach, it is not possible to trace back in history. For example, in this case, you do not have the ability to know that Peter lived in Washington before.

Usage:

About 50% of the time.

When to use Type 1:

Type 1 slowly changing dimension should be used when it is not necessary to keep track of historical changes in your Data Warehouse.

How to Ingest Data using SCD Type 1 Framework

We at Proden Technologies have built a series of Scripts to ingest several data patterns such as slowly changing dimension Type 1, Type 2 into Hadoop Big Data environment. Follow the instructions below to download the SCD Type 1 Data Ingestion Framework and load data:
  1. Download the Data Ingestion Framework for SCD Type 1.
  2. Follow the instructions to copy the scripts to your Hadoop Big Data environment.
  3. Change Directory to the location where you have copied the scripts
  4. Use the sample command line below to ingest data into Hadoop.

Note

Ensure Sqoop is installed and configured correctly. 
Use bash shell to execute the shell scripts in this Framework.

Command

./Type1.sh Source_SQL="Source Data as SQL statement" Target_Table="Table where data should be loaded" Target_Columns="Target Columns to load data" Mapped_Columns="Source to Target Column Map" Data_Base_Name="Hadoop Database Name to load data" SQL_File_Location="SQL File Location" Log_File="Log File Name to log the script execution details"

No Key Name Description
1 Target_Table Table where data should be loaded
2 Join_Columns It joins the source and target columns.
3 Mapped_Columns source and target columns are mapped to load the data.
4 Source_SQL Enter Source SQL query, which has new and updated records.
5 Target_Columns It loads the comma separated list in target column.
6 Data_Base_Name It names the Hadoop database where the target table is located.
7 SQL_File_Location Enter the path from where the Type_1.sql file is downloaded.
8 Log_File It shows the script execution details.

Example

./Type_1.sh Target_Table="FRAUD_DEPT_2" Source_SQL="Select * From dept_1" Mapped_Columns="DEPARTMENT_ID,DEPARTMENT_ID,DEPARTMENT_NAME,DEPARTMENT_NAME,MANAGER_ID,MANAGER_ID,LOCATION_ID,LOCATION_ID,PHONE_NUMBER,PHONE_NUMBER" Join_Columns="DEPARTMENT_ID,DEPARTMENT_ID" Target_Columns="DEPARTMENT_ID,DEPARTMENT_NAME,MANAGER_ID,LOCATION_ID,PHONE_NUMBER,DEPT_PRIOR" Data_Base_Name="default" SQL_File_Location="/home/cloudera/Desktop/Type_1" Log_File="/home/cloudera/Desktop/Type_1/Log_File.txt"

Disclaimer

Please ensure you read and understand the following general disclaimer:

IMPORTANT:

THIS SOFTWARE END USER LICENSE AGREEMENT (“EULA”) IS A LEGAL AGREEMENT BETWEEN YOU AND PRODEN TECHNOLOGIES, INC. READ IT CAREFULLY BEFORE COMPLETING THE INSTALLATION PROCESS AND USING THE SOFTWARE. IT PROVIDES A LICENSE TO USE THE SOFTWARE AND CONTAINS WARRANTY INFORMATION AND LIABILITY DISCLAIMERS. BY INSTALLING AND USING THE SOFTWARE, YOU ARE CONFIRMING YOUR ACCEPTANCE OF THE SOFTWARE AND AGREEING TO BECOME BOUND BY THE TERMS OF THIS AGREEMENT. IF YOU DO NOT AGREE TO BE BOUND BY THESE TERMS, THEN SELECT THE "CANCEL" BUTTON. DO NOT PROCEED TO REGISTER & INSTALL THE SOFTWARE. LIABILITY DISCLAIMER•THE accel<>DS PROGRAM IS DISTRIBUTED "AS IS". NO WARRANTY OF ANY KIND IS EXPRESSED OR IMPLIED. YOU USE IT AT YOUR OWN RISK. NEITHER THE AUTHORS NOR PRODEN TECHNOLOGIES, INC. WILL BE LIABLE FOR DATA LOSS, DAMAGES AND LOSS OF PROFITS OR ANY OTHER KIND OF LOSS WHILE USING OR MISUSING THIS SOFTWARE.

RESTRICTIONS:

You may not use, copy, emulate, clone, rent, lease, sell, modify, decompile, disassemble, otherwise reverse engineer, or transfer any version of the Software, or any subset of it, except as provided for in this agreement. Any such unauthorized use shall result in immediate and automatic termination of this license and may result in criminal and/or civil prosecution.

TERMS:

This license is effective until terminated. You may terminate it by destroying the program, the documentation and copies thereof. This license will also terminate if you fail to comply with any terms or conditions of this agreement. You agree upon such termination to destroy all copies of the program and of the documentation, or return them to the author.
Download Data Ingestion Framework (Free)
SCD Type 1 Data Ingestion Scripts for Hadoop

No comments:

Post a Comment