Introduction to Basic Statistical Modeling with Healthcare Data

Apr 11

In recent years, the fields of data science and machine learning (ML) have gained significant attention, revolutionizing the way information is processed and decisions are made. Today we will dive into some basics on data science and ML and explore the practical applications in healthcare.

Understanding Data Science

Data science is the practice of making data more meaningful and informative to facilitate better decision-making. Historically, data scientists employed statistical analysis, particularly linear regression, to identify patterns and make predictions based on historical data. However, visualizing complex data with multiple dimensions became challenging using traditional scatter plots. Despite this limitation, regression analysis could handle n-dimensions, laying the foundation for more informed decision-making.

Machine learning, a recent addition to the data scientist's toolkit, enhances the capabilities of statistical analysis by applying techniques specifically developed for ML. Unlike traditional methods, ML excels at handling n-dimensions, enabling the identification of patterns beyond human recognition in complex data sets. It's crucial to note that ML models are not perfect; their aim is to be as accurate as possible, emphasizing the collaboration between ML and human oversight.

ML empowers computers to find correlations, make predictions, and discover solutions. Algorithms utilize statistical analysis in various ways to achieve these results, and ML is designed to complement rather than replace human roles. Its applications span diverse fields, from medicine, where it aids in diagnosis, to image processing, identifying anomalies like tumors.

An integral part of data science is preprocessing. Beyond basic extract, transform, load (ETL) processes, data scientists prepare data for statistical analysis by standardizing it, removing unnecessary information, and ensuring it is suitable for ML models. This step ensures that the data is refined, reducing noise and contributing to the accuracy of subsequent analyses.

Types of Algorithms:

Within ML there are three main types of algorithms:

Supervised Learning: Focuses on making predictions by training on existing data with known outcomes. Examples include linear regressions, artificial neural networks, and decision trees. In medicine, this can aid in predicting diagnoses based on patient lab data.

Unsupervised Learning: Seeks correlations in data without predefined outcomes. Algorithms like genetic algorithms and principal component analysis (PCA) are used for this purpose. In noise reduction, independent component analysis (ICA) can separate distinct audio streams from a room full of noise.

Reinforcement Learning: Employed when a computer needs to perform a series of tasks to achieve a desired result. Q-Learning is a notable reinforcement learning algorithm.

Practical Examples:

Forest cover type prediction

Source data can be found here

We have a dataset that contains some geographical data about sections of land along with the dominant species of tree for that area. There are 54 different features along with a classification of the main tree for almost 600,000 recorded areas of land. Our goal is now to build a ML model that can accurately predict the dominant species of tree for an area of land given some geographic data.

Models and Results:

Let’s start with Stochastic Gradient Descent. Without modifying any of the hyperparameters, we are able to generate a model that gets 71% accuracy on the training set. This is decent, but let's see if we can do better.

Random Forest, Decision Tree, and KNN each perform well with 100%, 98%, and 95% accuracy respectively. Near perfect accuracy on the training data can be a cause for concern. If the model is almost perfect on the training data, it typically implies that the model was overfit and will not perform well on the test data. We lucked out in this case as doing further testing of these three algorithms using cross-validation shows them performing at a rate of about 92-96%.

In this case, the Random Forest model is the preferred choice due to it performing better than the other two models while also keeping execution times low for training and utilizing the model.

Stochastic Gradient Descent: 71% accuracy in 41 seconds.

Decision Tree: 98% accuracy in 8 seconds.

Random Forest: 100% accuracy in 2 minutes.

Diabetes Readmittance Prediction

Source data can be found here.

Taking a look at another data set, we have about 100,000 records available across 49 features. The goal here is to try and predict which diabetic patients are more likely to be readmitted.

Unlike in the previous dataset, we need to do some preprocessing of the data before we can apply a model. Drop unnecessary columns, like Patient Number, because these have no effect on whether a patient will be readmitted. This takes us from 50 columns down to 38.

To help give insight on the data, we can look at correlations between all of the columns. This helps check for collinearity between columns. Correlation between two columns can fall within the range of [-1,1]. If two columns are highly correlated (positive or negative), it is common practice to either drop one of the columns, or combine them into a single column. The reason for this is because multiple highly-correlated columns will give those columns an unfair weight when training the model and hurting accuracy.

Another reason to look at correlation is that it helps to see what columns will have a bigger effect on training the model and achieving a higher accuracy. Conversely columns that have almost 0 correlation with the target are good candidates for dropping from the dataset as they are creating noise and costing time when training the model.

Additionally we will drop unnecessary rows with incomplete data. These steps result in a final table containing just under 100,000 rows and 77 columns.

We want to make sure all possible targets have roughly equal representation in the data set. This is to help prevent bias when training the model. Since the only 2 targets for this dataset are a binary readmission or no, we want a 50/50 split for the outcomes. This dataset is already balanced at 46.5%/53.5%.

Next we will standardize the data. While this is not always necessary, ML models tend to be more accurate with standardized data because it normalizes the values of all features which means they all carry the same weight.

Finally, split the data. When training a model, we want to use most of the data for training the model, but we want to keep some of the data for the sake of testing the model with data that it has not seen yet. This helps to make sure our model is accurate and not overfitted. For this split, we are using 80% of the data for training and 20% for testing.

Models and Results:

Random Forest looks promising with 97% accuracy. When we test it with data that the model hasn't seen yet, the model only scores a 58% accuracy. Given that this is a binary classification, it means this model is little better than flipping a coin. Having such high accuracy during training and so low with actual testing is another sign of overfitting. Overfitting occurs when the model learns the trained data too well and is unable to accurately predict any new cases it hasn't seen yet. It misses the forest for the trees.

By tweaking some hyperparameters we are able to get a Random Forest model that achieves 66% accuracy on the training data. When running the model against the test data we still achieve a 63% accuracy.

Trying to tune the hyperparameters for several other algorithms yields roughly the same results with about a 62% accuracy on the test data.

Conclusion, while getting a 63% accuracy rate is not bad, with better data we could probably get more accurate. Some columns included in the original data set that weren't used for the model include CPT and ICD10 codes that were assigned during the patient's visit. Grouping CPT codes together and giving them a category would help to further clarify what happened to the patient during their visit. Same thing goes for the specialty of the provider. We also might try finding other data features that are relevant to diabetic patients such as BFP and BP.

Logistic Regression: 62% accuracy in 30 seconds.

Random Forest: Initially 97% accuracy, but overfitting revealed with 58% accuracy on test data.

After hyperparameter tuning: 66% accuracy on training, 63% accuracy on test data.

Conclusion: While achieving a 63% accuracy rate is commendable, further improvement could be made with additional relevant features, such as CPT and ICD10 codes.

Jennifer Hayes