Data Warehousing Assignment #2 ETL

Fall 2020

By: Fullwin Liang

Building an ETL Transformation to load and manage the Customer data as a Type 2 SCD

Successfully running the transformation of Type 2 SCD.

Using the Database Explore features to explore the data

Database Explorer Example (Query time >= 1 hour, using Oracle Cloud Connection)

SQLDeveloper Example (Query time < 3 second)

Testing the Slowly Changing Dimension Management

Successfully running the transformation of Type 2 SCD

Viewing my data in SQLDeveloper to make sure there are 2 versions for customer “Elia” by filtering my data.

The query below confirms that the type 2 SCD is active and working.

Exercise 1 – Importing and Transforming Sales Representatives as Type 2 SCD

Successfully running the transformation of Type 2 SCD. For this exercise, I first created a CSV file

input for the Sales Representative csv file, and I implemented a Dimension lookup/update where it

inserts when there is new data which is a Type 2 SCD. We can see when I queried the data in

SQLDeveloper the new columns like version and salesrep_dim_id are successfully implemented.

Viewing my data in SQLDeveloper to make sure the data is transformed properly and successfully loaded in.

Implementing Type 1 Slowly Changing Dimension – Update in Place

Successfully running the transformation of Type 1 SCD.

After updating the Products csv file and transforming the data again, I utilized SQLDeveloper to Query

product_id = 2302 to check if the Type 1 SCD is active and working. As we can see here, the product status has

been updated to “obsolete” while still version 1 with no updated rows. This shows that the transformation is

working.

Exercise 2 – Importing and Transforming Promotions as Type 1 SCD

For this exercise, I did the same steps for exercise 1, but instead of “insert”, I changed it to “punch

through” to implement the Type 1 SCD which changes the row value instead of inserting a new row.

Generating a Date Dimension table

Exercise 3 – Embellishing the Date Dimension with Additional Date information

For this exercise, I embellished the date dimension by adding Sales Day of Week, Sales Day of Week

Name and Sales Day of the Month. I followed the format displayed on the tutorial and added it onto

the calculator transformation.

Exercise 4 – Embellishing the Date Dimension with Holiday information

For the exercise I imported the Microsoft Excel Holiday file along with a stream lookup transformation.

I then updated the Dimension lookup/update with the holiday fields and matched with the Sales_Date

field with the HOLIDAY_DATE field in stream lookup.

I queried by filtering the dates that are holidays to ensure that the embellished date dimension is

updated and active.

Developing the Dimensional Lookup transformations for loading Fact Tables

For this part, I imported the Orders CSV file and created 5 Lookup for each dimension without the

update and the calculation of the total sales. With this we have our fact table that connects all our

dimensions together.

Exercise 5 – Importing new Sales Data

For this exercise, I changed the csv input to the new “orders_export_new.csv” and ran the

transformation to import the new 18 records to the sales fact table.

SQL Code (Combined view with no duplicated columns)

create view dataset as

select

c.*,

d.SALES_MONTH_NAME,

d.SALES_DAY_OF_WEEK_NAME,

d.HOLIDAY_DESCRIPTION,

d.IS_A_HOLIDAY,

d.SALES_DATE,

d.SALES_DAY_OF_YEAR,

d.SALES_MONTH,

d.SALES_YEAR,

d.SALES_QUARTER,

d.SALES_DAY_OF_WEEK,

d.SALES_DAY_OF_THE_MONTH,

s.ORDER_ID,

s.SALESREP_DIM_ID,

s.PRODUCTS_DIM_ID,

s.PROMOTION_DIM_ID,

s.DATE_DIM_ID,

s.DOLLARS_SOLD,

s.AMOUNT_SOLD,

p.PRODUCT_NAME,

p.LANGUAGE_ID,

p.PRODUCT_STATUS,

p.WARRANTY_PERIOD,

p.PRODUCT_DESCRIPTION,

p.CATALOG_URL,

p.SUB_CATEGORY_NAME,

p.SUB_CATEGORY_DESCRIPTION,

p.CATEGORY_NAME,

p.PRODUCT_ID,

p.MIN_PRICE,

p.LIST_PRICE,

p.SUPPLIER_ID,

p.WEIGHT_CLASS,

p.CATEGORY_ID,

p.PARENT_CATEGORY_ID,

pr.PROMO_NAME,

pr.PROMOTION_ID,

pr.PROMO_ID,

sr.FIRST_NAME,

sr.LAST_NAME,

sr.EMAIL,

sr.PHONE_NUMBER,

sr.JOB_ID,

sr.COMMISSION_PCT,

sr.SALESREP_ID,

sr.EMPLOYEE_ID,

sr.HIRE_DATE,

sr.SALARY,

sr.MANAGER_ID,

sr.DEPARTMENT_ID

from sales_fact s

INNER JOIN customer_dim c on s.customer_dim_id = c.customer_dim_id

INNER JOIN date_dim d on s.date_dim_id = d.date_dim_id

INNER JOIN products_dim p on s.products_dim_id = p.products_dim_id

INNER JOIN promotion_dim pr on s.promotion_dim_id = pr.promotion_dim_id

INNER JOIN salesrep_dim sr on s.salesrep_dim_id = sr.salesrep_dim_id;

Data Cleaning with Python

My dataset I have exported using SQL seems to have a lot of rows, so I wanted to do some

data profiling and cleaning. I have imported the data into Jupyter Notebook using the Pandas library

in Python. I checked for duplicated rows and discovered that my data set has 402 duplicated rows. I

went on to drop these duplicated rows and then I exported the data as a new csv file to later on

import into Tableau to analyze.

Tableau

For my Tableau KPI analysis I’ve created two visualizations that I have combined in one

dashboard. The first KPI is total sales classified by category along with states. With this we can

identify which category has the highest revenue along with which state province has the most

revenue within each product category. For example, in this visualization we can see that the

hardware category has the highest revenue. We are also able to tell that Minnesota has the

least sales within each category.

For the second graph, I’ve created a time series line graph visualization. This graph

shows the total sales, credit limit and amount sold over time. The line graph itself is the total

sale and the thickness of the graph indicates the credit limit over time. We can see trends such

as, as total sale increases or is starting to increase, the credit limit generally increases. I also

implemented a bar graph into this graph by utilizing the dual axis feature in Tableau. Over here

we can seek patterns such as when total sales increases, the quantity sold, or amount sold

also increases. We can conclude that these 3 variables are positive directly correlated

because they increase or decrease while the other variable increase or decreases.

Conclusion

For this assignment, I took approximately 21~24 hours to complete from reading the

homework outline, downloading the software and the actual work itself. I was able to do this

assignment by starting early and contributing around 2-3 hours to it every day. The most

difficult part of this assignment for me was the ETL process. This was because it was

something completely new to me. At first, I had trouble connecting my Oracle ATP database to

Pentaho, but, after many trial and errors, I now know how to effectively and efficiently connect

to it. In addition, I had many small mistakes during my ETL transformations. However, after

multiple revisions and debugging I feel that I have gotten more use to how the transformation

process should be implemented. I was comfortable with exporting the data with SQL, cleaning

the data with Python and analyzing with Tableau. Overall, I feel that this homework assignment

was a very helpful reference to the group project ETL part, and I gained a lot of knowledge that

I can implement into my group project.