CIS 4400 Data Warehousing for
Analytics Project Fall 2020
By: Fullwin Liang
course number: 36156
course section: CIS4400-CMWA
INTRODUCTION
Our organization specializes in big data analysis and data science. Our company gathers
data from various of open sources and analyzes the historical data to seek trends or patterns in
order to gain insights to forecast for the future. The data being used for analysis is NYC 311
Complaints. Our focus on this targeted data is specifically, “Illegal Parking”. Throughout NYC,
there are countless numbers of tickets given to cars that park illegally. According to the number
of NYC complaints from the beginning of 2020 to September 1st, we can see that there are
113,687 complaints regarding illegal parking.
We designed a data warehouse and utilized a database management system (Oracle
SQLDeveloper) to analyze the NYC illegal parking complaints. We will query data and
aggregate variables such as location, agency and descriptors to answer questions such as, which
borough has the most tickets given for illegal parking and how were the incidents reported? We
also hope to display data visualization and KPI metrics by utilizing business intelligence tools
such as tableau to make find interesting trends in our dataset.
DATA SOURCE
NYC 311 Complaint: https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-
2010-to-Present/erm2-nwe9
PYTHON DATA CLEANING
The columns ["Address Type","Facility Type","Due Date","Vehicle Type","Taxi Company
Borough","Taxi Pick Up Location","Bridge Highway Name","Bridge Highway Direction","Road
Ramp","Bridge Highway Segment"] have all rows 113,687 null values. As we can see from the
bar chart visualization, they have 0 records, so we have decided to drop these columns from our
dataset.
DIMENSIONAL MODEL DIAGRAM (Lucid Chart)
Exporting Separated CSV Files With Python For ETL Process
I n [16 ) :
st at us
= df [ [ "Uni que Ke y " , st at u s " ))
st at
us
out [16J :
Unique Key
Status
43671448
Closed
45840022
Closed
4584-461 4
Cl osed
458462 73
Cl osed
4584 72 60
Cl osed
4 71 7Q474
Closed
47182556
CJosed
47185590
Closed
47185636
CJosed
47189565
Closed
113687 rows x 2 columns
In [1 7): st at us. to_csv(' st at us. csv' , index =Fal se )
#Exporting csv
In [1 8 ) : i ll ega l _par ki ng_f act = df[ c· u nique Ke y " , · c r eat ed Dat e· , " cl ose d Dat e " ]]
i ll e ga ! _ par ki ng_f act
out [1s J:
Unique Key
Created Date
Clos ed
Date
0 43671448 04129.12020 02 : 4 50 PM 0 4129.12020 03 :0 5 :04 P M
45840 0 22 03/ 17/2020 1 2 :3Q:38 AM 03/ 17/2020 02:03: 48 AM
2 4584-461 4 0 3/171202008:28:06PM 03/18/202008:36:07AM
3 45846273 03/17/2020 10:47:26PM 03/18/202009:03:29AM
4 4584
7260
0
3/171202004:14:39
PM
03/18/2
02002:@:32 AM
113682 4717Q47 4 0&'
0Q2i
0
20
02:
48:
36
PM
0
8/0
9.12020
03
:06
:09
P
M
1 1 3 6 8 3 4 71 8 2 556
08 i09 1202!l08:3Q:06A M
Oat09i2020 12:26:28 PM
113684 47185590 08l0Ql2020 05:53:02 AM 08l0Ql2020 06:35:14 AM
113685 47185636 08109/2020 09:16:30 PM Oat09i2020 09:34:56 PM
113686 47189565 08/10,202004:26:15 PM 08/ 10,20 20 05:17:29 PM
11368 7 rows x 3 column s
I n [19 ) : i ll ega l _par ki ng_f act . t o_cs v( ' i ll ega l _par king _f act . cs v ' , i ndex =Fal s e ) #Expor t i ng csv
ETL PROCESS
Firstly, import our dataset as a csv file. During the Select Value step, we only keep those fields
that are used in the dimension. For Agency Dimension look up/update, we select “Agency” and
“Agency Name”, rename “Agency Name” to “Agency_Name” to remove space that may cause
problem later on. Load Data into Oracle by connecting to OracleCloud. Then we set name our
target table as “agency_dim” and create a surrogate key called “agency_dim_id” on the
dimensional look up/update step.
Similar steps as Agency dimensional lookup/update, we select “Open Data Channel Type” on
Select Value step and rename it to “Open_Data_Channel_Type”, set our target table as
“channel_dim” and create a surrogate key called “channel_dim_id”.
We generate “Created_Data” by using calculation step “StartDate+ increment_date”. And also
create other new field through “Created_Date”
Select all fields about Location, except “Location”, “X Coordinate (State Plane)”, “Y Coordinate
(State Plane)”, because location will be duplicate with latitude and longitude, X and Y
Coordinate are number that hard to load and use. And replacing space with underscore when
renaming those fields. Set “location dim” as target dim, create “location_dim_id” as surrogate
key.
Set “status_dim” as target table, create “status_dim_id” as surrogate key.
For the first step, import a time data file provide by Professor. Holowczak as the xls format.
Change “time_dim_id”, “hours”, “minutes”, “second” to the integer type on the field section.
Second step is to load the time data into Oracle. Set “new_time_dim” as target table, and create a
surrogate key called “new_time_dim_id”.
After running all other dimensional lookup/update, we can start to work with the fact
dimensional lookup. First of all, we confirm the format of data is same as the raw data we have.
Then we need to select all the field we need on the Select Value Step and rename the field name
as what we did before. On the calculation step, we shave the time from the date as below.
Make sure the Conversion mask is the same format as the “Created_Date” we got through
generating date step. Then We connect all dimensional look up piece together. For all of them,
we only used for looking up data, so uncheck the “update the dimensional?” and “Enable the
cache?” bottoms. Final step is to add a “table Output”, get all the field in complaint fact
dimensional table. For our case, we have all the dimensional id, and “unique key”.
Tips for ETL processing:
tests the connection every time before execute SQL.
After executing SQL, we can run the transformation to load the data into Oracle Cloud Database.
EXTRACTING JOINED DATASET AS CSV WITH SQL DEVELOPER
PHYSICAL MODEL OF OUR DATA (SQL DEVELOPER):
Data Visualization and Analytics (PYTHON)
KPI:
Which boroughs have the highest and least incident of tickets given?
Which zip codes have the highest and least frequency of occurrence?
What is the ratio of the current status of reports in each borough?
What were the most common channel type reported throughout each borough?
This is a correlation heat map that shows the correlation of determination or R^2 between each
variable. The closer the R^2 is to 1, the more positively correlated the two variables. With this
chart we can easily tell which two variables are highly correlated
This is a geographical map of NYC showing each incident of illegal parking reported with each
borough by utilizing the latitude and longitude features of the data. As we can see here, it is a
perfect visualization of NYC’S geographical feature and we are able to get an idea of how the
location is spread out.
Lastly, we have these two horizontal bar charts. The bar chart on the left shows the frequency of
the incidents reported by channel type. We can see that Mobile is highest while Phone is least.
On the right bar chart, we can see the count of frequency of total incidents in each borough. We
can see Brooklyn has the highest, then Queens, then Manhattan and Bronx with Staten Island
coming in with the least number of incidents occurred.
Data Visualization and Analytics (TABLEAU)
This vertical bar chart displays the status of each incident in each borough. We can easily
visualize from this graph that most cases are closed and only a small portion is in progress.
Count of boroughs with the specified channel type. This horizontal bar chart gives an
overview of how the incidents are reported in each borough. As we can see in the bar chart
Mobile is highest and Phone is least.
This is a geographical map of NYC. The size of the circle represents the number of
incidents occurred in that specific zip code. The color of the circle on the other hand, represents
the borough that they are located in. This visualizations allow for people to easily see which zip
code has the largest number of incidents and locate which borough it is according to the legend
for people that are unfamiliar with NYC locations.
CONCLUSION
Software/Tools Used:
Discord:
Our Discord Server acted as a communication collaboration tool for our Group Project. It is
where we communicated and shared files.
Microsoft Office (Words, Outlook):
We used Microsoft tools such as Words to create documents, reports and Outlook to
communicate via email.
Google Drive (Docs):
We used Google Docs to work on reports and documents in parallel.
Oracle Cloud:
We utilized Oracle Cloud to create an ATP Database which acted as our Data Warehouse.
Oracle SQLDeveloper:
This was our DBMS in which we used to query data and create DDL, DML SQL statements.
Pentaho Data Integration Tool:
This was used for our ETL process.
Python:
This was used to clean data, dedupe data, organize data, transform data and create visualizations.
Tableau Desktop:
This business intelligence tool was used to create data visualization and data analysis for KPI
metrics
References: http://holowczak.com/ (Professor’s Website)
Proposal Benefits:
Throughout the analysis we did on the illegal Parking complaint case, we found out the
data has big differences between areas. For our data, there is more illegal Parking complaint case
on Brooklyn and Queens, which we consider may because there is more residential area in these
two Boroughs. More residents mean more cars and less parking place for drives. The advice we
provide is to build more underground parking places in Brooklyn and Queens, this method may
reduce the number of illegal parking complaints. In addition, from our data, we also know which
zip code specifically in each borough has the most incidents of illegal parking. In the zip codes
that have the most incidents, we can propose to send additional traffic officers in those areas to
provide punishments to those who break the parking law to hopefully have the occurrence of
illegal parking decrease.