Data Table

Out[107]:
gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
0 Male 67.0 0 1 Yes Private Urban 228.69 36.6 formerly smoked 1
1 Female 61.0 0 0 Yes Self-employed Rural 202.21 NaN never smoked 1
2 Male 80.0 0 1 Yes Private Rural 105.92 32.5 never smoked 1
3 Female 49.0 0 0 Yes Private Urban 171.23 34.4 smokes 1
4 Female 79.0 1 0 Yes Self-employed Rural 174.12 24.0 never smoked 1
... ... ... ... ... ... ... ... ... ... ... ...
5105 Female 80.0 1 0 Yes Private Urban 83.75 NaN never smoked 0
5106 Female 81.0 0 0 Yes Self-employed Urban 125.20 40.0 never smoked 0
5107 Female 35.0 0 0 Yes Self-employed Rural 82.99 30.6 never smoked 0
5108 Male 51.0 0 0 Yes Private Rural 166.29 25.6 formerly smoked 0
5109 Female 44.0 0 0 Yes Govt_job Urban 85.28 26.2 Unknown 0

5110 rows × 11 columns

Data Profiling

Out[108]:
never smoked       1892
Unknown            1544
formerly smoked     885
smokes              789
Name: smoking_status, dtype: int64
Out[109]:
Urban    2596
Rural    2514
Name: Residence_type, dtype: int64
Out[110]:
Yes    3353
No     1757
Name: ever_married, dtype: int64
Out[111]:
Female    2994
Male      2115
Other        1
Name: gender, dtype: int64
Out[112]:
Private          2925
Self-employed     819
children          687
Govt_job          657
Never_worked       22
Name: work_type, dtype: int64

Checking for null values

Out[114]:
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64
Out[115]:
count    4908.00000
mean       28.89456
std         7.85432
min        10.30000
25%        23.50000
50%        28.10000
75%        33.10000
max        97.60000
Name: bmi, dtype: float64
Out[116]:
0       False
1       False
2       False
3       False
4       False
        ...  
5105    False
5106    False
5107    False
5108    False
5109     True
Name: smoking_status, Length: 5109, dtype: bool

No Duplicated Rows Returned

Out[117]:
gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke

Total Null Value After Dropped Rows

Out[119]:
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

EDA

Out[121]:
<seaborn.axisgrid.FacetGrid at 0x1d25e69e280>
Out[122]:
<AxesSubplot:>
Out[123]:
array([[<AxesSubplot:title={'center':'age'}>,
        <AxesSubplot:title={'center':'hypertension'}>],
       [<AxesSubplot:title={'center':'heart_disease'}>,
        <AxesSubplot:title={'center':'avg_glucose_level'}>],
       [<AxesSubplot:title={'center':'bmi'}>,
        <AxesSubplot:title={'center':'stroke'}>]], dtype=object)
C:\Users\fullw\Anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[124]:
<AxesSubplot:xlabel='age', ylabel='Density'>

Random Forest Classification Modeling

Out[128]:
age hypertension heart_disease avg_glucose_level bmi gender_Female gender_Male ever_married_No ever_married_Yes work_type_Govt_job work_type_Never_worked work_type_Private work_type_Self-employed work_type_children Residence_type_Rural Residence_type_Urban smoking_status_Unknown smoking_status_formerly smoked smoking_status_never smoked smoking_status_smokes
0 67.0 0 1 228.69 36.6 0 1 0 1 0 0 1 0 0 0 1 0 1 0 0
2 80.0 0 1 105.92 32.5 0 1 0 1 0 0 1 0 0 1 0 0 0 1 0
3 49.0 0 0 171.23 34.4 1 0 0 1 0 0 1 0 0 0 1 0 0 0 1
4 79.0 1 0 174.12 24.0 1 0 0 1 0 0 0 1 0 1 0 0 0 1 0
5 81.0 0 0 186.21 29.0 0 1 0 1 0 0 1 0 0 0 1 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5104 13.0 0 0 103.08 18.6 1 0 1 0 0 0 0 0 1 1 0 1 0 0 0
5106 81.0 0 0 125.20 40.0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0
5107 35.0 0 0 82.99 30.6 1 0 0 1 0 0 0 1 0 1 0 0 0 1 0
5108 51.0 0 0 166.29 25.6 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0
5109 44.0 0 0 85.28 26.2 1 0 0 1 1 0 0 0 0 0 1 1 0 0 0

4908 rows × 20 columns

Out[129]:
0       1
2       1
3       1
4       1
5       1
       ..
5104    0
5106    0
5107    0
5108    0
5109    0
Name: stroke, Length: 4908, dtype: int64
Out[133]:
RandomForestClassifier()
Accuracy: 0.9572301425661914
Out[136]:
RandomForestClassifier(n_jobs=1)
Out[137]:
avg_glucose_level                 0.255469
age                               0.229505
bmi                               0.229411
heart_disease                     0.028599
hypertension                      0.028048
smoking_status_never smoked       0.021126
work_type_Private                 0.021091
Residence_type_Rural              0.020894
work_type_Self-employed           0.020834
Residence_type_Urban              0.020261
smoking_status_formerly smoked    0.019080
smoking_status_smokes             0.018621
gender_Male                       0.018620
gender_Female                     0.018482
smoking_status_Unknown            0.015896
work_type_Govt_job                0.015478
ever_married_Yes                  0.010221
ever_married_No                   0.007825
work_type_children                0.000540
work_type_Never_worked            0.000002
dtype: float64
[NbConvertApp] Converting notebook Classification_Stroke.ipynb. to html
[NbConvertApp] Writing 446850 bytes to Classification_Stroke.ipynb.html