BIOS 534 Homework 3

Important Information

Last Update: 04/09/24

  • 30 Points (see individual sections for specific point values)

  • Due by 11:59 PM (Canvas Clock Time) on April 24, 2024

  • No extensions will be granted (Please let us know of medical emergencies ASAP)

  • Read problem descriptions carefully. Make note of any random_state values that need to be set

  • Send questions to TAs danwei.yao@emory.edu and sjahojun.yu@emory.edu

Please use Python scikit-learn and Jupyter / Jupyter Lab to perform your work. Please compile all results into a single PDF file and submit via Emory Canvas. Code must be included in the submission.

In Jupyter or Jupyter Lab you can use File -> Save and Export Notebook As -> PDF to create a PDF file. If you have problems with this procedure please contact the TAs for advice.

If you cannot get the PDF generated then submit your Jupyter Notebook though please inform the TAs that you were unable to create a PDF.

Academic Honesty policies apply. You may not share, accept, or co-develop code for any of the problems. This includes use of AI technology (chatGPT) or CoPilot to arrive at a solution. All code is to be your own. You may not share or accept code from current or former students of this class or any other source.

1. Regression (15 Total Points)

Introduction

Please access the URL below to obtain a data set which represents data related to standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland around the year 1888.

This represents a data frame with 47 observations on 6 variables, each of which represents a percentage. All variables but ‘Fertility’ give proportions of the population.

https://raw.githubusercontent.com/steviep42/bios_534/main/data/swiss.csv

We are interested in predicting Infant_Mortality

Variable Name Description
Fertility Common standardized fertility measure
Agriculture Percentage of males involved in agriculture as occupation
Examination Percentage of draftees receiving highest mark on army examination
Education Percentage of education beyond primary school for draftees
Catholic Percentage of ‘catholic’ (as opposed to ‘protestant’)
Infant_Mortality Percentage of live births who live less than 1 year

The first 5 rows should look like the following:

Fertility Agriculture Examination Education Catholic Infant_Mortality
0 80.2 17.0 15 12 9.96 22.2
1 83.1 45.1 6 9 84.84 22.2
2 92.5 39.7 5 5 93.40 20.2
3 85.8 36.5 12 7 33.77 20.3
4 76.9 43.5 17 15 5.16 20.6

a) (1-point) Training Data

  1. Create a training and testing pair of data sets with a 75/25 Training / Test ratio
  2. Specify a random_state of 0 when creating the train_test_split
  3. We will be predicting Infant_Mortality
  4. All other columns will be considered as predictors until otherwise instructed
X_train shape: (35, 5)
X_test shape:  (12, 5)

b) (1-point) Decision Tree Regressor

Create a Decision Tree Regressor estimator with a random state of zero and then fit the estimator using the training data.

c) (1-point) Tree Depth

Present the depth of the tree. Your answer does not need to be identical to my answer below.

d) (1-points) RMSE

Present the RMSE (Root Mean Squared Error) for both training and testing data. Your answer does not need to be identical to the below output but it shouldn’t be far off.

Observed Tree Depth: 11
Train RMSE: 0.0
Test  RMSE: 4.375

e) (3-points) Tree Depth Experiment (3 points)

Decision Trees Regressors in scikit-learn have an argument max_depth that allows you to specify the maximum depth of any tree you build.

Using the training and testing data from above, build a series of single Decision tree estimators of specified maximum depths varying from from 1, 2, 3, … 20. A for loop will be helpful here.

As you loop through the fitting process of the Decision Tree Regressor with varying maximum depth values, keep track of the training and test RMSE values emerging from each prediction. You could keep this information in lists or a data frame. In my case I have the following RMSEs.

Note you can use your mouse or track pad to scroll to the right on the below numbers.

Train RMSE: [2.231, 1.955, 1.731, 1.428, 0.918, 0.655, 0.337, 0.154, 0.065, 0.024, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


Test  RMSE: [3.521, 3.304, 3.258, 4.033, 3.84, 4.436, 3.712, 3.805, 4.443, 3.886, 3.805, 3.756, 3.818, 3.847, 3.817, 4.365, 3.834, 3.886, 3.921]

f) (2-points) Boxplot

Create boxplots of the Training and Testing RMSE

g) (1-points) RMSE Plot

Create a scatterplot plot of the training and testing RMSE lists vs the max_depth values.

h) (2-points) OverFitting vs UnderFitting?

Describe what is happening in the plot in terms of fitting (over or under). Does the model generalize well to the test data? How closely is the model following the training data?

i) (3-points) Preventing Overtraining

There are various arguments for a Decision Tree estimator that might keep an estimator from being over fit to the training data. For example, consider the following Decision tree regressor plot. Note that this is just an example.

Focus on the the bottom row which has leaf nodes which is a term used to describe the terminal nodes of the tree where no further splits are made. These nodes represent the final decision or prediction that the model makes.

You’ll notice some leaf nodes have only 2 samples, suggesting the tree may be overly tailored to some of the training data. A more general model would usually have more than 2 samples in a terminal node.

Check the documentation for Decision Tree regression for a parameter that allows you to set a minimum samples for leaf nodes. Here are some of the parameters.

min_impurity_decrease
min_samples_leaf
min_samples_split
min_weight_fraction_leaf

You will set this argument to obtain a more realistic (not perfect) training RMSE.

Repeat the previous experiment in question 1d) with the additional parameter you identify. Set it to a specific value.

As in 1e), plot the RMSE for both training and testing. The new plot should show a higher Training RMSE, though it may not be identical to that below. But it should definitenly NOT be 0.0 or close to it.

2. Classification (15 Points)

Next we’ll work with classification as it relates to some Appendicitis data. This dataset was acquired in a retrospective study from a cohort of pediatric patients admitted with abdominal pain to Children’s Hospital St. Hedwig in Regensburg, Germany. There are missing values in this dataset which will need to be addressed.

Introduction

Let’s start by reading the file into a data frame called app Here is the URL. The first 5 rows will look like the following:

https://raw.githubusercontent.com/steviep42/bios_534/main/data/append.csv
Age Height Alvarado_Score Weight WBC_Count CRP Body_Temperature RDW Diagnosis
0 12.68 148.0 4.0 37.0 7.7 0.0 37.0 12.2 1
1 14.10 147.0 5.0 69.5 8.1 3.0 36.9 12.7 0
2 14.14 163.0 5.0 62.0 13.2 3.0 36.6 12.2 0
3 16.37 165.0 7.0 56.0 11.4 0.0 36.0 13.2 0
4 11.08 163.0 5.0 45.0 8.1 0.0 36.9 13.6 1

a) (1-point) Missing Values

Determine the number of missing values per column.

Age                  0
Height              25
Alvarado_Score      50
Weight               2
WBC_Count            4
CRP                  9
Body_Temperature     5
RDW                 24
Diagnosis            0
dtype: int64

b) (2-point) Imputation

  1. Next, use Simple Imputer package to replace missing values in each column with the most frequently occurring value in that column.
  2. Save the resulting datafame to a new data frame called app_imputed
  3. Verify that all missing values have been replaced.
  4. You can also now check the mean value of app_imputed.Alvarado_Score to make sure you used the right replacement strategy.
Age                 0
Height              0
Alvarado_Score      0
Weight              0
WBC_Count           0
CRP                 0
Body_Temperature    0
RDW                 0
Diagnosis           0
dtype: int64 

app_imputed Alvarado_Score mean value: 5.863

c) (1-point) Training Data

Using app_imputed create new X and y objects and create a 75/25 training and testing split with a random_state=0

X_train shape: (585, 8)
X_test shape:  (195, 8)

d) (1-point) Random Forest

Build a RandomForest using the above created training and testing data. 1) Make sure that the minimum number of samples in a leaf node is 5. 2) Print the classification report for both the training and testing data.

Classification Report for Training Data

              precision    recall  f1-score   support

           0       0.86      0.77      0.81       238
           1       0.85      0.92      0.88       347

    accuracy                           0.86       585
   macro avg       0.86      0.84      0.85       585
weighted avg       0.86      0.86      0.85       585

Classification Report for Testing Data

              precision    recall  f1-score   support

           0       0.70      0.49      0.58        79
           1       0.71      0.85      0.78       116

    accuracy                           0.71       195
   macro avg       0.70      0.67      0.68       195
weighted avg       0.71      0.71      0.70       195

e) (2 points) Recall

The classificiation_report function has a lot of information which might be too much if we are interested in only the recall for each class we are predicting (0 or 1).

In the above example, focus on the Training Data report and you will see a recall for each class (0 and 1) of approxomately 0.77 and 0.91 respectively. In the Testing data the recall for each class (0 and 1) is 0.49 and 0.85 respectively. We want just this information for both the training and testing data.

Check the sklearn documentation for a metric method that can provide this information more directly. We want this info for both training and testing. Do NOT attempt to parse the text returned by the classification_report function:

Relative to the above Decision Tree estimator:

  1. Print out only the recall for classes 0 and 1 for Training and Testing data sets
  2. Round the result to 2 decimal places.
  3. Check your result for this question against the classification reports in 2d)
Training:

Class 0: 0.77
Class 1: 0.92


Testing:

Class 0: 0.49
Class 1: 0.85

f) (1-point) Variable Importance

Create a horizontal bar plot of feature importance in decreasing order which should resemble the following:

<Figure size 640x480 with 0 Axes>

g) (2-point) Dropping Variables

So based on the above plot,

  1. Create a new version of X (call it Xnew) that drops the 3 lowest variables in terms of magnitude.
  2. Using Xnew and the existing y, regenerate a new training and testing split 75/25 (random_state=0)
  3. Repeat the random forest creation in 2d) (make sure min_samples_leaf=5 and random_state=0)
  4. Print out the class recall values for the training and testing data.
Training:

Class 0: 0.76
Class 1: 0.91


Testing:

Class 0: 0.48
Class 1: 0.84

h) (2-point) Effect

What impact did dropping the three lowest variables have on the recall for both the training and testing data? In particular what efffect did dropping have on the model - if any?

i) (2 points) Scaling

Let’s use a different method to see if we can observe a comparable or even better result. This might not be possible but we’ll never know unless we try. Support Vector Machines are not “scale invariant” so we have to scale the training and test data. Look at the notebooks for guidance but basically you can use the StandardScaler to do this. Using the X_train and X_test data from above

  1. Create a scaler transformer
  2. Do a fit_transform on X_train and save it into a new object called X_train_scaled
  3. Do a transform (not fit_transform) on X_test and save it into a new object called X_test_scaled
X_train_scaled - First 5 rows 

[[-1.98581692  0.53260711 -1.59371777  0.3733249  -0.3929034 ]
 [ 1.05686742 -0.40870757  0.19414116 -0.73954024 -0.553926  ]
 [ 0.02505154  1.47392179 -0.82168778  1.14667187  2.30869799]
 [-2.43430507  0.06194977 -1.77946935  0.54308399 -0.4107948 ]
 [ 0.68459294  0.53260711  0.60047273 -0.30571145 -0.464469  ]] 

X_test_scaled - First 5 rows 

[[-0.7539286   1.47392179 -0.59530304  0.78829157 -0.285555  ]
 [-0.57175092  1.47392179 -0.1309241  -0.02277964 -0.4286862 ]
 [ 0.52044042 -1.82067959  0.8326622  -0.47547054 -0.4286862 ]
 [ 1.32361528  1.00326445  0.54242537  0.84487793 -0.464469  ]
 [-1.36948052 -0.87936491 -1.1409483  -1.17336902 -0.553926  ]]

j) (1 points) Support Vector Machine

  1. Fit the SVC classifier with a random_state=0
  2. Get the recall info for both classes (0 and 1) for the Training and Testing Dat
  3. Print it out
Training Recall:

Class 0: 0.64
Class 1: 0.8

Testing Recall:

Class 0: 0.58
Class 1: 0.84