Data Warehousing Assignment #2 ETL
Fall 2020
By: Fullwin Liang
Building an ETL Transformation to load and manage the Customer data as a Type 2 SCD
Successfully running the transformation of Type 2 SCD.
Using the Database Explore features to explore the data
Using the Database Explore features to explore the data
Database Explorer Example (Query time >= 1 hour, using Oracle Cloud Connection)
SQLDeveloper Example (Query time < 3 second)
Testing the Slowly Changing Dimension Management
Successfully running the transformation of Type 2 SCD
Viewing my data in SQLDeveloper to make sure there are 2 versions for customer “Elia” by filtering my data.
The query below confirms that the type 2 SCD is active and working.
Exercise 1 Importing and Transforming Sales Representatives as Type 2 SCD
Successfully running the transformation of Type 2 SCD. For this exercise, I first created a CSV file
input for the Sales Representative csv file, and I implemented a Dimension lookup/update where it
inserts when there is new data which is a Type 2 SCD. We can see when I queried the data in
SQLDeveloper the new columns like version and salesrep_dim_id are successfully implemented.
Viewing my data in SQLDeveloper to make sure the data is transformed properly and successfully loaded in.
Implementing Type 1 Slowly Changing Dimension Update in Place
Successfully running the transformation of Type 1 SCD.
After updating the Products csv file and transforming the data again, I utilized SQLDeveloper to Query
product_id = 2302 to check if the Type 1 SCD is active and working. As we can see here, the product status has
been updated to “obsolete” while still version 1 with no updated rows. This shows that the transformation is
working.
Exercise 2 Importing and Transforming Promotions as Type 1 SCD
For this exercise, I did the same steps for exercise 1, but instead of “insert”, I changed it to “punch
through” to implement the Type 1 SCD which changes the row value instead of inserting a new row.
Generating a Date Dimension table
Exercise 3 Embellishing the Date Dimension with Additional Date information
For this exercise, I embellished the date dimension by adding Sales Day of Week, Sales Day of Week
Name and Sales Day of the Month. I followed the format displayed on the tutorial and added it onto
the calculator transformation.
Exercise 4 Embellishing the Date Dimension with Holiday information
For the exercise I imported the Microsoft Excel Holiday file along with a stream lookup transformation.
I then updated the Dimension lookup/update with the holiday fields and matched with the Sales_Date
field with the HOLIDAY_DATE field in stream lookup.
I queried by filtering the dates that are holidays to ensure that the embellished date dimension is
updated and active.
Developing the Dimensional Lookup transformations for loading Fact Tables
For this part, I imported the Orders CSV file and created 5 Lookup for each dimension without the
update and the calculation of the total sales. With this we have our fact table that connects all our
dimensions together.
Exercise 5 Importing new Sales Data
For this exercise, I changed the csv input to the new “orders_export_new.csv” and ran the
transformation to import the new 18 records to the sales fact table.
SQL Code (Combined view with no duplicated columns)
create view dataset as
select
c.*,
d.SALES_MONTH_NAME,
d.SALES_DAY_OF_WEEK_NAME,
d.HOLIDAY_DESCRIPTION,
d.IS_A_HOLIDAY,
d.SALES_DATE,
d.SALES_DAY_OF_YEAR,
d.SALES_MONTH,
d.SALES_YEAR,
d.SALES_QUARTER,
d.SALES_DAY_OF_WEEK,
d.SALES_DAY_OF_THE_MONTH,
s.ORDER_ID,
s.SALESREP_DIM_ID,
s.PRODUCTS_DIM_ID,
s.PROMOTION_DIM_ID,
s.DATE_DIM_ID,
s.DOLLARS_SOLD,
s.AMOUNT_SOLD,
p.PRODUCT_NAME,
p.LANGUAGE_ID,
p.PRODUCT_STATUS,
p.WARRANTY_PERIOD,
p.PRODUCT_DESCRIPTION,
p.CATALOG_URL,
p.SUB_CATEGORY_NAME,
p.SUB_CATEGORY_DESCRIPTION,
p.CATEGORY_NAME,
p.PRODUCT_ID,
p.MIN_PRICE,
p.LIST_PRICE,
p.SUPPLIER_ID,
p.WEIGHT_CLASS,
p.CATEGORY_ID,
p.PARENT_CATEGORY_ID,
pr.PROMO_NAME,
pr.PROMOTION_ID,
pr.PROMO_ID,
sr.FIRST_NAME,
sr.LAST_NAME,
sr.EMAIL,
sr.PHONE_NUMBER,
sr.JOB_ID,
sr.COMMISSION_PCT,
sr.SALESREP_ID,
sr.EMPLOYEE_ID,
sr.HIRE_DATE,
sr.SALARY,
sr.MANAGER_ID,
sr.DEPARTMENT_ID
from sales_fact s
INNER JOIN customer_dim c on s.customer_dim_id = c.customer_dim_id
INNER JOIN date_dim d on s.date_dim_id = d.date_dim_id
INNER JOIN products_dim p on s.products_dim_id = p.products_dim_id
INNER JOIN promotion_dim pr on s.promotion_dim_id = pr.promotion_dim_id
INNER JOIN salesrep_dim sr on s.salesrep_dim_id = sr.salesrep_dim_id;
Data Cleaning with Python
My dataset I have exported using SQL seems to have a lot of rows, so I wanted to do some
data profiling and cleaning. I have imported the data into Jupyter Notebook using the Pandas library
in Python. I checked for duplicated rows and discovered that my data set has 402 duplicated rows. I
went on to drop these duplicated rows and then I exported the data as a new csv file to later on
import into Tableau to analyze.
Tableau
For my Tableau KPI analysis I’ve created two visualizations that I have combined in one
dashboard. The first KPI is total sales classified by category along with states. With this we can
identify which category has the highest revenue along with which state province has the most
revenue within each product category. For example, in this visualization we can see that the
hardware category has the highest revenue. We are also able to tell that Minnesota has the
least sales within each category.
For the second graph, I’ve created a time series line graph visualization. This graph
shows the total sales, credit limit and amount sold over time. The line graph itself is the total
sale and the thickness of the graph indicates the credit limit over time. We can see trends such
as, as total sale increases or is starting to increase, the credit limit generally increases. I also
implemented a bar graph into this graph by utilizing the dual axis feature in Tableau. Over here
we can seek patterns such as when total sales increases, the quantity sold, or amount sold
also increases. We can conclude that these 3 variables are positive directly correlated
because they increase or decrease while the other variable increase or decreases.
Conclusion
For this assignment, I took approximately 21~24 hours to complete from reading the
homework outline, downloading the software and the actual work itself. I was able to do this
assignment by starting early and contributing around 2-3 hours to it every day. The most
difficult part of this assignment for me was the ETL process. This was because it was
something completely new to me. At first, I had trouble connecting my Oracle ATP database to
Pentaho, but, after many trial and errors, I now know how to effectively and efficiently connect
to it. In addition, I had many small mistakes during my ETL transformations. However, after
multiple revisions and debugging I feel that I have gotten more use to how the transformation
process should be implemented. I was comfortable with exporting the data with SQL, cleaning
the data with Python and analyzing with Tableau. Overall, I feel that this homework assignment
was a very helpful reference to the group project ETL part, and I gained a lot of knowledge that
I can implement into my group project.