CIS 4400 Data Warehousing for

Analytics Project Fall 2020

By: Fullwin Liang

course number: 36156

course section: CIS4400-CMWA

INTRODUCTION

Our organization specializes in big data analysis and data science. Our company gathers

data from various of open sources and analyzes the historical data to seek trends or patterns in

order to gain insights to forecast for the future. The data being used for analysis is NYC 311

Complaints. Our focus on this targeted data is specifically, “Illegal Parking”. Throughout NYC,

there are countless numbers of tickets given to cars that park illegally. According to the number

of NYC complaints from the beginning of 2020 to September 1st, we can see that there are

113,687 complaints regarding illegal parking.

We designed a data warehouse and utilized a database management system (Oracle

SQLDeveloper) to analyze the NYC illegal parking complaints. We will query data and

aggregate variables such as location, agency and descriptors to answer questions such as, which

borough has the most tickets given for illegal parking and how were the incidents reported? We

also hope to display data visualization and KPI metrics by utilizing business intelligence tools

such as tableau to make find interesting trends in our dataset.

DATA SOURCE

NYC 311 Complaint: https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-

2010-to-Present/erm2-nwe9

PYTHON DATA CLEANING

The columns ["Address Type","Facility Type","Due Date","Vehicle Type","Taxi Company

Borough","Taxi Pick Up Location","Bridge Highway Name","Bridge Highway Direction","Road

Ramp","Bridge Highway Segment"] have all rows 113,687 null values. As we can see from the

bar chart visualization, they have 0 records, so we have decided to drop these columns from our

dataset.

DIMENSIONAL MODEL DIAGRAM (Lucid Chart)

Exporting Separated CSV Files With Python For ETL Process

I n [16 ) :

st at us

= df [ [ "Uni que Ke y " , • st at u s " ))

st at

out [16J :

Unique Key

Status

43671448

Closed

45840022

Closed

4584-461 4

Cl osed

458462 73

Cl osed

4584 72 60

Cl osed

11 368 2

4 71 7Q474

Closed

113683

47182556

CJosed

113684

47185590

Closed

113685

47185636

CJosed

113686

47189565

Closed

113687 rows x 2 columns

In [1 7): st at us. to_csv(' st at us. csv' , index =Fal se )

#Exporting csv

In [1 8 ) : i ll ega l _par ki ng_f act = df[ c· u nique Ke y " , · c r eat ed Dat e· , " cl ose d Dat e " ]]

i ll e ga ! _ par ki ng_f act

out [1s J:

Unique Key

Created Date

Clos ed

Date

0 43671448 04129.12020 02 : 4 50 PM 0 4129.12020 03 :0 5 :04 P M

45840 0 22 03/ 17/2020 1 2 :3Q:38 AM 03/ 17/2020 02:03: 48 AM

2 4584-461 4 0 3/171202008:28:06PM 03/18/202008:36:07AM

3 45846273 03/17/2020 10:47:26PM 03/18/202009:03:29AM

4 4584

7260

3/171202004:14:39

03/18/2

02002:@:32 AM

113682 4717Q47 4 0&'

0Q2i

02:

48:

8/0

9.12020

:06

:09

1 1 3 6 8 3 4 71 8 2 556

08 i09 1202!l08:3Q:06A M

Oat09i2020 12:26:28 PM

113684 47185590 08l0Ql2020 05:53:02 AM 08l0Ql2020 06:35:14 AM

113685 47185636 08109/2020 09:16:30 PM Oat09i2020 09:34:56 PM

113686 47189565 08/10,202004:26:15 PM 08/ 10,20 20 05:17:29 PM

11368 7 rows x 3 column s

I n [19 ) : i ll ega l _par ki ng_f act . t o_cs v( ' i ll ega l _par king _f act . cs v ' , i ndex =Fal s e ) #Expor t i ng csv

ETL PROCESS

Firstly, import our dataset as a csv file. During the Select Value step, we only keep those fields

that are used in the dimension. For Agency Dimension look up/update, we select “Agency” and

“Agency Name”, rename “Agency Name” to “Agency_Name” to remove space that may cause

problem later on. Load Data into Oracle by connecting to OracleCloud. Then we set name our

target table as “agency_dim” and create a surrogate key called “agency_dim_id” on the

dimensional look up/update step.

Similar steps as Agency dimensional lookup/update, we select “Open Data Channel Type” on

Select Value step and rename it to “Open_Data_Channel_Type”, set our target table as

“channel_dim” and create a surrogate key called “channel_dim_id”.

We generate “Created_Data” by using calculation step “StartDate+ increment_date”. And also

create other new field through “Created_Date”

Select all fields about Location, except “Location”, “X Coordinate (State Plane)”, “Y Coordinate

(State Plane)”, because location will be duplicate with latitude and longitude, X and Y

Coordinate are number that hard to load and use. And replacing space with underscore when

renaming those fields. Set “location dim” as target dim, create “location_dim_id” as surrogate

key.

Set “status_dim” as target table, create “status_dim_id” as surrogate key.

For the first step, import a time data file provide by Professor. Holowczak as the xls format.

Change “time_dim_id”, “hours”, “minutes”, “second” to the integer type on the field section.

Second step is to load the time data into Oracle. Set “new_time_dim” as target table, and create a

surrogate key called “new_time_dim_id”.

After running all other dimensional lookup/update, we can start to work with the fact

dimensional lookup. First of all, we confirm the format of data is same as the raw data we have.

Then we need to select all the field we need on the Select Value Step and rename the field name

as what we did before. On the calculation step, we shave the time from the date as below.

Make sure the Conversion mask is the same format as the “Created_Date” we got through

generating date step. Then We connect all dimensional look up piece together. For all of them,

we only used for looking up data, so uncheck the “update the dimensional?” and “Enable the

cache?” bottoms. Final step is to add a “table Output”, get all the field in complaint fact

dimensional table. For our case, we have all the dimensional id, and “unique key”.

Tips for ETL processing:

tests the connection every time before execute SQL.

After executing SQL, we can run the transformation to load the data into Oracle Cloud Database.

EXTRACTING JOINED DATASET AS CSV WITH SQL DEVELOPER

PHYSICAL MODEL OF OUR DATA (SQL DEVELOPER):

Data Visualization and Analytics (PYTHON)

KPI:

Which boroughs have the highest and least incident of tickets given?

Which zip codes have the highest and least frequency of occurrence?

What is the ratio of the current status of reports in each borough?

What were the most common channel type reported throughout each borough?

This is a correlation heat map that shows the correlation of determination or R^2 between each

variable. The closer the R^2 is to 1, the more positively correlated the two variables. With this

chart we can easily tell which two variables are highly correlated

This is a geographical map of NYC showing each incident of illegal parking reported with each

borough by utilizing the latitude and longitude features of the data. As we can see here, it is a

perfect visualization of NYC’S geographical feature and we are able to get an idea of how the

location is spread out.

Lastly, we have these two horizontal bar charts. The bar chart on the left shows the frequency of

the incidents reported by channel type. We can see that Mobile is highest while Phone is least.

On the right bar chart, we can see the count of frequency of total incidents in each borough. We

can see Brooklyn has the highest, then Queens, then Manhattan and Bronx with Staten Island

coming in with the least number of incidents occurred.

Data Visualization and Analytics (TABLEAU)

This vertical bar chart displays the status of each incident in each borough. We can easily

visualize from this graph that most cases are closed and only a small portion is in progress.

Count of boroughs with the specified channel type. This horizontal bar chart gives an

overview of how the incidents are reported in each borough. As we can see in the bar chart

Mobile is highest and Phone is least.

This is a geographical map of NYC. The size of the circle represents the number of

incidents occurred in that specific zip code. The color of the circle on the other hand, represents

the borough that they are located in. This visualizations allow for people to easily see which zip

code has the largest number of incidents and locate which borough it is according to the legend

for people that are unfamiliar with NYC locations.

CONCLUSION

Software/Tools Used:

Discord:

Our Discord Server acted as a communication collaboration tool for our Group Project. It is

where we communicated and shared files.

Microsoft Office (Words, Outlook):

We used Microsoft tools such as Words to create documents, reports and Outlook to

communicate via email.

Google Drive (Docs):

We used Google Docs to work on reports and documents in parallel.

Oracle Cloud:

We utilized Oracle Cloud to create an ATP Database which acted as our Data Warehouse.

Oracle SQLDeveloper:

This was our DBMS in which we used to query data and create DDL, DML SQL statements.

Pentaho Data Integration Tool:

This was used for our ETL process.

Python:

This was used to clean data, dedupe data, organize data, transform data and create visualizations.

Tableau Desktop:

This business intelligence tool was used to create data visualization and data analysis for KPI

metrics

References: http://holowczak.com/ (Professor’s Website)

Proposal Benefits:

Throughout the analysis we did on the illegal Parking complaint case, we found out the

data has big differences between areas. For our data, there is more illegal Parking complaint case

on Brooklyn and Queens, which we consider may because there is more residential area in these

two Boroughs. More residents mean more cars and less parking place for drives. The advice we

provide is to build more underground parking places in Brooklyn and Queens, this method may

reduce the number of illegal parking complaints. In addition, from our data, we also know which

zip code specifically in each borough has the most incidents of illegal parking. In the zip codes

that have the most incidents, we can propose to send additional traffic officers in those areas to

provide punishments to those who break the parking law to hopefully have the occurrence of

illegal parking decrease.