Predicting Discharge Locations based on Machine Learning Methods and EHR data

My responsibilities in the team are

Querying and merging raw data from google big query.
Performing exploratory data analysis and developing baseline models as benchmark (cooperating with Heather Qiu).
Building Gradient Boosting Model and Adaptive Boosting Model and performing hyperparameter tuning.
Designing dashboard overall layout, specifically the home page, the EDA page (cooperating with Mengyue Han), the calendar page.
Deploy the dashboard to google cloud platform.

Learning Techniques:

Serverless Cloud Platform (BigQuery, command line tool, dash deployment)
Preprocessing packages: pandas, missingno
Machine Learning Techniques: xgboost, optuna, sklearn
Dashboard Design: dash, plotly

Dashboard: https://bios823-mimic-dashboard.ue.r.appspot.com/

Background

After the outbreak of COVID-19, we are facing shortage of key hospital utilizations all around the world. Thus, accurate prediction on healthcare resources is or urgent need more than ever. However, care managers cannot make precise predictions on these resources manually so here's where the data mining techniques could help.

The discharge location, especially those who are discharged to skilled nursing facilities and other hospitals, is an important indicator of cost of care and hospital utilizations. Thus, we will develop a model to classify and predict discharge locations and further provide information to aid care managers in allocating key healthcare utilizations.

The research question is to predict discharge location at the individual level. The audience of this project is care managers and clinicians who need information to allocate key hospital resources and manage hospital capacity.

The specific research aims are

Predicting the discharge locations of each patient accurately based on electronic health records
Identifying the important variables associated with the discharge locations
Displaying the model development and predictions on dashboard and deploying the model on google cloud platform

Exploratory Data Anlysis

This section includes the querying process of raw dataset and exploratory data analysis, such as imputation of missingness and encoding of categorical variables.

Querying and Merging Raw Dataset

We extracted a total of 46,520 observations from MIMIC database comprising of patients admitted to intensive care units from June 2001-October 2012. Only the first record of ICU stay was kept if a patient had multiple ICU stay records. The predictors comprised of demographics data, vital signs, and laboratory test results. The following are data preprocessing steps

Extracting the mean, max and min of the vital signs recorded by Metavision or CareVue on the first day of admission for each observation, such as heart rate, glucose and respiratory rate from table INPUTEVENTS_CV and INPUTEVENTS_MV. The definition of all items are retrieved from table d_items.
Combining tables ADMISSIONS, ICUSTAYS, PATIENTS to get demographics data, such as age, admission type, religion, and ethnicity.
Note: This dataset was deidentified so the age of patients older than 89-years-old was transformed to arbitrarily large numbers. Here, we reset all the age of patients older than 89 to 89 years old.
Note: the admission time was also transformed to avoid data users identifying patients.
Extracting the laboratory test results from the table LABEVENTS table on the first dat of admission for each observation, such as white blood count and blood urea nitrogen test.

Target Variables

This figure shows a comparison of the target variable in the raw data and processed data. We combined patients discharged to home, home health care and other locations which would consume very limited resources together and combined all other facilities, such as short term hospitals and long term hospitals, to one group which would consume a large amount of resources.

Numerical Varibles

All the numerical variables were imputed by single imputation. We attempted to apply iterative imputation but this algorithm couldn't converge on this dataset. The variables with >50% missingness were removed. This figure demonstrates the correlation between target variables and some selected numerical variables. Most of the numerical variables are not highly linearly correlated with the outcome variables.

correlation

Categorical Variables

We applied one-hot-encoding to the categorical variables and removed the levels with less frequent occurrences. For example, we only kept 9 diagnoses among more than 10,000 different diagnosis since most of these diagnosis were consisted of less than 1% of the sample.

We used cross tables to observe the association between these variables and the target variables. For example, the patients admitted urgently are more likely to die in hospital compared to admitted electively which is consistent to our knowledge. New born are almost impossible to be discharged to skilled nursing facility.

crosstab

Development of Predictive Modeling

Dummy classifier and logistic regression were applied as baseline method and four different machine learning algorithms were applied to attempt to improve predictive ability. The performance of these methods are shown in this table.

	Accuracy	F1 (Weighted)	Recall (SNF)	Recall (OTHERS)
Dummy Classifier	0.59	0.43	0	0
Logistic Regression	0.65	0.59	0.26	0.16
KNN	0.60	0.53	0.20	0.14
SVM	0.66	0.62	0.29	0.22
Random Forest	0.68	0.64	0.30	0.29
Gradient Boosting	0.68	0.66	0.34	0.32

Gradient Boosting beats all the other models based on all the evaluation metrics. The patients discharged to skilled nursing facility and other hospital facilities will consume more hospital utilizations so we focus on the predictive accuracy on these two groups. Here, let's talk about how to build a gradient boosting model by python.

Mechanisms of Gradient Boosting

The two main components of gradient boosting are decision tree ensembles and additive learning. The tree ensemble model sums the prediction of multiple trees together which complements the shortage of high variance of a single decision tree. In other words, it combines a set of weak learners to become a strong learner. The objective of gradient boosting is to minimize the loss of the model by adding weaking learners using a gradient descent procedure. Every new weak learner aims to reduce the residual loss and the previous trees remains unchanged. Since the new weak learner focuses on residual loss, it is possibly the reason it performs best on our imbalanced dataset.

Hyperparameter Tuning

The objective of hyperparameter tuning is to improve the predictive ability of our model and control overfitting problem. The two most important parameters in gradient boosting are learning rate and number of rounds. Another three different parameters can control overfitting:

Maximum depth
Minimum child weight
Gamma

First, let's try control overfitting by setting maximum depth of trees. We need to set a range for every hyperparameter. The learning rate is between 0.01-0.5 and the number of rounds is set to between 2-1000. If we set learning rate to be smaller, the numer of rounds need to be increased. Here, the objective is set to be multi:softmax since the target variable is a multi-level categorical variables.

import xgboost as xgb
class Objective(object):
    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __call__(self, trial):
        X, y = self.X, self.y # load data once only

        learning_rate = trial.suggest_float('learning_rate', 0.01,0.5)
        colsample_bytree = trial.suggest_float('colsample_bytree', 0.2, 1)
        n_estimators = trial.suggest_int('n_estimators', 2, 1000, log=True)
        max_depth = trial.suggest_int('max_depth', 1, 10)

        clf = xgb.XGBClassifier(
            objective='multi:softmax',
            learning_rate=learning_rate,
            n_estimators = n_estimators,
            max_depth = max_depth,
            colsample_bytree = colsample_bytree,
            random_state = 123 
        )


        score = cross_val_score(clf, X, y, n_jobs=-1,  cv=5).mean()
        return score

The overall accuracy of the 5-fold cross validation was the evaluation metrics in the hyperparameter tuning process. The best value of 200 rounds of trials was 0.69 which was the highest overall accuracy compared to other models. However, this model was also facing the overfitting problem. The accuracy on the training dataset was 0.85.

Thus, let's try another parameter to control overfitting--gamma. Gamma is the minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithms will be. [1] In other words, the larger the gamma is, the less likely the algorithms will be overfitting.

	Training Accuracy	Testing Accuracy
Gradient Boosting-Max Depth	0.85	0.69
Gradient Boosting-Gamma	0.77	0.68
Adaptive Boosting	0.67	0.66

Although testing accuracy decreased 1% if we controlled overfitting by gamma, the training accuracy decreased for 8% which indicated the overfitting problem was solved to some degree. We also tried AdaBoosting but the gradient boosting beated the AdaBoosting in terms of testing accuracy.

Dashboard Design

After building the predictive model, we decided to show the overall process of development of the model and prediction of the model to a dashboard. Here, we use dash package to make our dashboard.

There are five parts of this dashboard:

Home page: links to dataset, github repository and github pages
EDA page: visualization of original dataset
Modelling page: comparison of different models and identification of important variables
Calendar page: prediction results displayed to care managers
Online ML deployment: interactive prediction page

Multi-page Design

Based on the design of the content, the multi-page layout fits the best. The key step of multi-page dashboard is the file directory.

-- app.py
--index.py
--apps
  --home.py
  --EDA.py
  --Models.py
  --calendar.py
  --deploy.py

The app.py creates the server and style of the dashboard. Other files will import the app object from app.py.

import dash
import dash_bootstrap_components as dbc

# bootstrap theme
# https://bootswatch.com/lux/
external_stylesheets = [dbc.themes.LUX]

app = dash.Dash(__name__, external_stylesheets=external_stylesheets, suppress_callback_exceptions=True)
server = app.server

The index.py creates navigation bar to switch between each page and builds the path to each page.

Import all pages in the app from apps directory
Building the navigation bar
Creating the dropdown menu
Combining the head of navigation bar and the dropdown menu
Creating the layout design
Putting the navigation bar as the header
Writing a call back function to switch between different pages

import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output, State
import dash_bootstrap_components as dbc
import dash

from app import app, server
# import all pages in the app
from apps import home, EDA, calendar, Models, deploy_app

# building the navigation bar
# https://github.com/facultyai/dash-bootstrap-components/blob/master/examples/advanced-component-usage/Navbars.py
dropdown = dbc.DropdownMenu(
    children=[
        dbc.DropdownMenuItem("Home",
                             href="/haha"
                             ),
        dbc.DropdownMenuItem("EDA",
                             href="/EDA"
                             ),
        dbc.DropdownMenuItem("Models",
                             href="/Models"
                             ),
        dbc.DropdownMenuItem("Calendar",
                                    href="/calendar"
                                    ),
        dbc.DropdownMenuItem("Online ML",
                                    href="/deploy_app"
                                    )
    ],
    nav = True,
    in_navbar = True,
    label = "Explore",
)

navbar = dbc.Navbar(
    dbc.Container(
        [
            html.A(
                dbc.Row(
                    [
                        dbc.Col(dbc.NavbarBrand("MIMIC Predictive Modelling", className="ml-2")),
                    ],
                    align="center",
                    no_gutters=True,
                ),
                href="/home",
            ),
            dbc.NavbarToggler(id="navbar-toggler2"),
            dbc.Collapse(
                dbc.Nav(
                    # right align dropdown menu with ml-auto className
                    [dropdown], className="ml-auto", navbar=True
                ),
                id="navbar-collapse2",
                navbar=True,
            ),
        ]
    ),
    color="dark",
    dark=True,
    className="mb-4",
)

# embedding the navigation bar
app.layout = html.Div([
    dcc.Location(id='url', refresh=False),
    navbar,
    html.Div(id='page-content')
])

# controlling which page to display
@app.callback(Output('page-content', 'children'),
              [Input('url', 'pathname')])
def display_page(pathname):
    if pathname == '/EDA':
        return EDA.layout
    elif pathname == '/Models':
        return Models.layout
    elif pathname == '/calendar':
        return calendar.layout
    elif pathname == '/deploy_app':
        return deploy_app.layout
    else:
        return home.layout

if __name__ == '__main__':
    app.run_server(debug=True)

Calendar Page

The calendar page is deigned for users who have no knowledge of machine learning algorithms. The prediction is based on the data collected at the admission stage (or the first day of admission). Care managers can see the predicted discharge locations directly in terms of time series. For example, on 2101 Januray, three patients are expected to be discharged to skilled nursing facilities.

Deployment to Google Cloud Platform

Our final step is to deploy the dashboard to google cloud platform so that everyone can see the results of our model. We only need three steps before closing our project!

Step 1: Set up your Google Cloud Platform Project

Install GCP SDK tools
Create a new project by selecting option [3] and choose the correct region
Check if your project is active using:

bash $ gcloud config get-value project bios823-mimic-dashboard

Step 2: Set up the Project Folder

Ensure you import server from app.py to index.py

from app import app, server

Create the app.yaml file

``` service: default runtime: python37 instance_class: F4_HIGHMEM

entrypoint: gunicorn -b :$PORT index:server --timeout 120 ```

This is the most important file of the deployment of the dashboard to GCP.

Runtime: The name of the runtime environment that is used by your app. Here, I use python 3.7 as the runtime environment.
instance_class: The instance class for this service.

Here I tell gunicorn to search for server from index.py so that's why it's important to import server in index.py. This version of gunicorn cannot understand the structure of python data, so it would not get attribute of app if you don't instantiate it.

Creating requirements.txt to specify the packages you need
The first method is use pip freeze

$ pip freeze > requirements.txt
Or you can write it manually.

Step 3: Deploying the model to google cloud

Uploading and deploying all your files

$ gcloud app deploy

Note: if your get an error: Access not Configured. You just need to follow the link in the error message and enable the service.

Viewing your dashboard

$ gcloud app browse

References

https://xgboost.readthedocs.io/en/latest/
https://towardsdatascience.com/beginners-guide-to-building-a-multi-page-dashboard-using-dash-5d06dbfc7599
https://datasciencecampus.github.io/deploy-dash-with-gcp/
https://cloud.google.com/appengine/docs/standard/python/config/appref