















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A project that utilizes machine learning techniques such as backward elimination algorithm, logistic regression, and refcv to predict heart disease based on patient attributes. The project aims to identify impending heart disease and make lifestyle changes for high-risk patients, potentially reducing complications. The project uses a publicly available dataset from kaggle and evaluates the results using confusion matrix and cross-validation.
Typology: Schemes and Mind Maps
1 / 23
This page cannot be seen from the preview
Don't miss anything!
A Mini-Project Report On โHeart Disease Predictionโ [COMP 484] Machine Learning Submitted by Nirusha Manandhar (31) Sagun Lal Shrestha (53) Ruchi Tandukar (57)
Submitted to Dr. Bal Krishna Bal Associate Professor Department of Computer Science and Engineering
ABSTRACT..................................................................................................................................... i LIST OF FIGURES: ....................................................................................................................... ii LIST OF TABLES: ........................................................................................................................ iii
iii
Table 1: Confusion Matrix Obtained after training the data (feature selection by backward elimination) ................................................................................................................................... 11 Table 2: Confusion Matrix Obtained after training the data (feature selection by RFECV method) ....................................................................................................................................................... 11 Table 3: Comparison between the feature selection models after training and testing through LogisticRegression model ............................................................................................................. 13 Table 4: Work Division ................................................................................................................ 14 Table 5: Major modules and classes used from Sklearn ............................................................... 15
According to the World Health Organization, every year 12 million deaths occur worldwide due to Heart Disease. The load of cardiovascular disease is rapidly increasing all over the world from the past few years. Many researches have been conducted in attempt to pinpoint the most influential factors of heart disease as well as accurately predict the overall risk. Heart Disease is even highlighted as a silent killer which leads to the death of the person without obvious symptoms. The early diagnosis of heart disease plays a vital role in making decisions on lifestyle changes in high-risk patients and in turn reduce the complications. This project aims to predict future Heart Disease by analyzing data of patients which classifies whether they have heart disease or not using machine-learning algorithms.
The major challenge in heart disease is its detection. There are instruments available which can predict heart disease but either they are expensive or are not efficient to calculate chance of heart disease in human. Early detection of cardiac diseases can decrease the mortality rate and overall complications. However, it is not possible to monitor patients every day in all cases accurately and consultation of a patient for 24 hours by a doctor is not available since it requires more sapience, time and expertise. Since we have a good amount of data in todayโs world, we can use various machine learning algorithms to analyze the data for hidden patterns. The hidden patterns can be used for health diagnosis in medicinal data.
Machine learning techniques have been around us and has been compared and used for analysis for many kinds of data science applications. The major motivation behind this research-based project was to explore the feature selection methods, data preparation and processing behind the training models in the machine learning. With first hand models and libraries, the challenge we face today is data where beside their abundance, and our cooked models, the accuracy we see during training, testing and actual validation has a higher variance. Hence this project is carried out with the motivation to explore behind the models, and further implement Logistic Regression
model to train the obtained data. Furthermore, as the whole machine learning is motivated to develop an appropriate computer-based system and decision support that can aid to early detection of heart disease, in this project we have developed a model which classifies if patient will have heart disease in ten years or not based on various features (i.e. potential risk factors that can cause heart disease) using logistic regression. Hence, the early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients and in turn reduce the complications, which can be a great milestone in the field of medicine.
The main objective of developing this project are:
The dataset is publicly available on the Kaggle Website at [4] which is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. It provides patient information which includes over 4000 records and 14 attributes. The attributes include: age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting, sugar blood, resting electrocardiographic results, maximum heart rate, exercise induced angina, ST depression induced by exercise, slope of the peak exercise, number of major vessels, and target ranging from 0 to 2, where 0 is absence of heart disease. The data set is in csv (Comma Separated Value) format which is further prepared to data frame as supported by pandas library in python.
Figure 1: Original Dataset Snapshot
The education data is irrelevant to the heart disease of an individual, so it is dropped. Further with this dataset pre-processing and experiments are then carried out.
The main purpose of designing this system is to predict the ten-year risk of future heart disease. We have used Logistic regression as a machine-learning algorithm to train our system and various feature selection algorithms like Backward elimination and Recursive feature elimination. These algorithms are discussed below in detail.
Logistic Regression is a supervised classification algorithm. It is a predictive analysis algorithm based on the concept of probability. It measures the relationship between the dependent variable (TenyearCHD) and the one or more independent variables (risk factors) by estimating probabilities using underlying logistic function (sigmoid function). Sigmoid function is used as a cost function to limit the hypothesis of logistic regression between 0 and 1 (squashing) i.e. 0 โค hฮธ (x) โค 1.
In logistic regression cost function is defined as:
๐ถ๐๐ ๐ก(hฮธ(x), y) = {โ log(โ๐(๐ฅ))โ log(1 โ โ๐(๐ฅ)) ๐๐ ๐ฆ = 0^ ๐๐ ๐ฆ = 1
Logistic Regression relies highly on the proper presentation of data. So, to make the model more powerful, important features from the available data set are selected using Backward elimination and recursive elimination techniques.
While building a machine learning model only the features which have a significant influence on the target variable should be selected. In the backward elimination method for feature selection, the first step is selecting a significance level or P-value. For our model, we have chosen a 5% significance level or P-value of 0.05. The feature with high P-value is identified, and if its P-value is greater than the significance level it is removed from the dataset. The model is fit again with a new dataset, and the process is repeated till all remaining features in dataset is less than the significance level. In this model, factors male, age, cigsPerDay, prevalentStroke, diabetes, and sysBP were chosen as significant ones after using the backward elimination algorithm.
Since the dataset consists of 4240 observations with 388 missing data and 644 observations to be risked for heart disease, two different experiments were performed for data preparation. First, we checked by dropping the missing data, leaving with only 3751 data and only 572 observations risked for heart disease.
This leads to reduced number of the observations providing irrelevant training to our model. So, we progressed with imputation of data with the mean value of the observations and scaling them using SimpleImputer and StandardScaler modules of Sklearn.
Figure 4: Dataset after Scaling and Imputing
Figure 3: Bar Graph of the Target Classes Before Dropping Figure 2: Bar Graph of the Target Classes before Dropping (^) Figure 3: Bar Graph of the Target Classes after DroppingFigure 2: Bar Graph of the Target Classes After Dropping
Correlation Matrix visualization Before Feature Selection shows
Figure 5: Correlation Matrix Visualization
It shows that there is no single feature that has a very high correlation with our target value. Also, some of the features have a negative correlation with the target value and some have positive. The data was also visualized through plots and bar graphs.
Feature Selection using Recursive Feature Elimination and Cross-Validated selection method:
Figure 8: Top 10 important features supported by RFECV
Finally, this resulting data split into 80% train and 20% test data, which was further passed to the LogisticRegression model to fit, predict and score the model.
For the evaluation of our output from our training the data, the accuracy was analyzed โConfusion matrixโ.
A confusion matrix, also known as an error matrix, is a table that is often used to describe the performance of a classification model (or โclassifierโ) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. It allows easy identification of confusion between classes e.g. one class is commonly mislabeled as the other. The key to the confusion matrix is the number of correct and incorrect predictions are summarized with count values and broken down by each class not just the number of errors made.
TP=3569 FP= FN=599 TN= Table 1: Confusion Matrix Obtained after training the data (feature selection by backward elimination)
TP=3582 FP= FN=600 TN= Table 2: Confusion Matrix Obtained after training the data (feature selection by RFECV method)
The accuracy is calculated as: Accuracy = (^) ๐๐+๐๐+๐น๐+๐น๐๐๐+๐๐
Where,
The obtained accuracy during training the data after feature selection using backward elimination was 86 % and during testing was 83%.
The obtained accuracy during training the data after feature selection using REFCV method was 86 % and during testing was 85 %.
When performing various methods of feature selection, testing it was found that backward elimination gave us the best results among others. The various methods tried were Backward Elimination with and without KFold, Recursive Feature Elimination with Cross Validation. The accuracy that was seen in them ranged around 85% with 85.5% being maximum. Though both methods gave similar accuracy but it was seen that in Backward Elimination we found that the number of misclassifications of True Negative was more and it was observed that the accuracy had more variance compared to RFEV. The precision of Backward Elimination and RFEV are 84% and 86% respectively. And the recalls are 0.99 and 1 respectively. The precision and recall also shows that the number of misclassifications is less in RFECV than in Backward Elimination.
Evaluation Metrics Backward Elimination RFECV Accuracy 83% 85% Recall 0.99 0. Precision 0.84 0. Table 3: Comparison between the feature selection models after training and testing through LogisticRegression model
Table 4: Work Division
Members Task Nirusha Manandhar Sagun Lal Shrestha Ruchi Tandukar Data Imputation and Scaling Data Cleaning
Exploratory Analysis
Feature Selection
Building Model
Result analysis and Accuracy Test Documentation