Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Machine Learning Approach for Heart Disease Prediction - Prof. Kumar, Schemes and Mind Maps of Machine Learning

A project that utilizes machine learning techniques such as backward elimination algorithm, logistic regression, and refcv to predict heart disease based on patient attributes. The project aims to identify impending heart disease and make lifestyle changes for high-risk patients, potentially reducing complications. The project uses a publicly available dataset from kaggle and evaluates the results using confusion matrix and cross-validation.

Typology: Schemes and Mind Maps

2022/2023

Uploaded on 04/10/2024

prabu-raj
prabu-raj ๐Ÿ‡ฎ๐Ÿ‡ณ

1 document

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Kathmandu University
Department of Computer Science and Engineering
Dhulikhel, Kavre
A Mini-Project Report
On
โ€œHeart Disease Predictionโ€
[COMP 484]
Machine Learning
Submitted by
Nirusha Manandhar (31)
Sagun Lal Shrestha (53)
Ruchi Tandukar (57)
Submitted to
Dr. Bal Krishna Bal
Associate Professor
Department of Computer Science and Engineering
Submission Date: 11th March 2020
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Machine Learning Approach for Heart Disease Prediction - Prof. Kumar and more Schemes and Mind Maps Machine Learning in PDF only on Docsity!

Kathmandu University

Department of Computer Science and Engineering

Dhulikhel, Kavre

A Mini-Project Report On โ€œHeart Disease Predictionโ€ [COMP 484] Machine Learning Submitted by Nirusha Manandhar (31) Sagun Lal Shrestha (53) Ruchi Tandukar (57)

Submitted to Dr. Bal Krishna Bal Associate Professor Department of Computer Science and Engineering

Submission Date: 11th^ March 2020

Table of Contents

ABSTRACT..................................................................................................................................... i LIST OF FIGURES: ....................................................................................................................... ii LIST OF TABLES: ........................................................................................................................ iii

ii

  • CHAPTER 1: INTRODUCTION LIST OF ABBREVIATIONS: iv
    • 1.1 Problem Definition
    • 1.2 Motivation
    • 1.3 Objectives
  • CHAPTER 2: RELATED WORKS................................................................................................
  • CHAPTER 3: DATASETS
  • CHAPTER 4: METHODS AND ALGORITHMS USED..............................................................
    • 4.1 Logistic Regression
    • 4.2 Backward Elimination Method:
    • 4.3 Recursive Feature Elimination using Cross-Validation (RFECV)
  • CHAPTER 5: EXPERIMENTS
    • 5.1 Data Preparation....................................................................................................................
    • 5.2 Exploratory Analysis:
    • 5.3 Feature Selection
    • 5.4 Training and testing
  • CHAPTER 6: EVALUATION METRICS
    • 6.1 Confusion Matrix
    • 6.2 Accuracy
    • 6.3 Recall
    • 6.4 Precision
  • CHAPTER 6: DISCUSSION ON RESULTS
  • CHAPTER 7: CONTRIBUTIONS
  • CHAPTER 9: CODE
    • 9.1 Libraries used:
  • CHAPTER 10: CONCLUSION
  • Figure 1: Original Dataset Snapshot LIST OF FIGURES:
  • Figure 2: Bar Graph of the Target Classes After Dropping
  • Figure 3: Bar Graph of the Target Classes Before Dropping
  • Figure 4: Dataset after Scaling and Imputing
  • Figure 5: Correlation Matrix Visualization.....................................................................................
  • Figure 6: Result from Feature Selection using Backward Elimination Method
  • Figure 7: Dataset After Dropping Columns after Feature Selection...............................................
  • Figure 8: Top 10 important features supported by RFECV

iii

LIST OF TABLES:

Table 1: Confusion Matrix Obtained after training the data (feature selection by backward elimination) ................................................................................................................................... 11 Table 2: Confusion Matrix Obtained after training the data (feature selection by RFECV method) ....................................................................................................................................................... 11 Table 3: Comparison between the feature selection models after training and testing through LogisticRegression model ............................................................................................................. 13 Table 4: Work Division ................................................................................................................ 14 Table 5: Major modules and classes used from Sklearn ............................................................... 15

CHAPTER 1: INTRODUCTION

According to the World Health Organization, every year 12 million deaths occur worldwide due to Heart Disease. The load of cardiovascular disease is rapidly increasing all over the world from the past few years. Many researches have been conducted in attempt to pinpoint the most influential factors of heart disease as well as accurately predict the overall risk. Heart Disease is even highlighted as a silent killer which leads to the death of the person without obvious symptoms. The early diagnosis of heart disease plays a vital role in making decisions on lifestyle changes in high-risk patients and in turn reduce the complications. This project aims to predict future Heart Disease by analyzing data of patients which classifies whether they have heart disease or not using machine-learning algorithms.

1.1 Problem Definition

The major challenge in heart disease is its detection. There are instruments available which can predict heart disease but either they are expensive or are not efficient to calculate chance of heart disease in human. Early detection of cardiac diseases can decrease the mortality rate and overall complications. However, it is not possible to monitor patients every day in all cases accurately and consultation of a patient for 24 hours by a doctor is not available since it requires more sapience, time and expertise. Since we have a good amount of data in todayโ€™s world, we can use various machine learning algorithms to analyze the data for hidden patterns. The hidden patterns can be used for health diagnosis in medicinal data.

1.2 Motivation

Machine learning techniques have been around us and has been compared and used for analysis for many kinds of data science applications. The major motivation behind this research-based project was to explore the feature selection methods, data preparation and processing behind the training models in the machine learning. With first hand models and libraries, the challenge we face today is data where beside their abundance, and our cooked models, the accuracy we see during training, testing and actual validation has a higher variance. Hence this project is carried out with the motivation to explore behind the models, and further implement Logistic Regression

model to train the obtained data. Furthermore, as the whole machine learning is motivated to develop an appropriate computer-based system and decision support that can aid to early detection of heart disease, in this project we have developed a model which classifies if patient will have heart disease in ten years or not based on various features (i.e. potential risk factors that can cause heart disease) using logistic regression. Hence, the early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients and in turn reduce the complications, which can be a great milestone in the field of medicine.

1.3 Objectives

The main objective of developing this project are:

  1. To develop machine learning model to predict future possibility of heart disease by implementing Logistic Regression.
  2. To determine significant risk factors based on medical dataset which may lead to heart disease.
  3. To analyze feature selection methods and understand their working principle.

CHAPTER 3: DATASETS

The dataset is publicly available on the Kaggle Website at [4] which is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. It provides patient information which includes over 4000 records and 14 attributes. The attributes include: age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting, sugar blood, resting electrocardiographic results, maximum heart rate, exercise induced angina, ST depression induced by exercise, slope of the peak exercise, number of major vessels, and target ranging from 0 to 2, where 0 is absence of heart disease. The data set is in csv (Comma Separated Value) format which is further prepared to data frame as supported by pandas library in python.

Figure 1: Original Dataset Snapshot

The education data is irrelevant to the heart disease of an individual, so it is dropped. Further with this dataset pre-processing and experiments are then carried out.

CHAPTER 4: METHODS AND ALGORITHMS USED

The main purpose of designing this system is to predict the ten-year risk of future heart disease. We have used Logistic regression as a machine-learning algorithm to train our system and various feature selection algorithms like Backward elimination and Recursive feature elimination. These algorithms are discussed below in detail.

4.1 Logistic Regression

Logistic Regression is a supervised classification algorithm. It is a predictive analysis algorithm based on the concept of probability. It measures the relationship between the dependent variable (TenyearCHD) and the one or more independent variables (risk factors) by estimating probabilities using underlying logistic function (sigmoid function). Sigmoid function is used as a cost function to limit the hypothesis of logistic regression between 0 and 1 (squashing) i.e. 0 โ‰ค hฮธ (x) โ‰ค 1.

In logistic regression cost function is defined as:

๐ถ๐‘œ๐‘ ๐‘ก(hฮธ(x), y) = {โˆ’ log(โ„Ž๐œƒ(๐‘ฅ))โˆ’ log(1 โˆ’ โ„Ž๐œƒ(๐‘ฅ)) ๐‘–๐‘“ ๐‘ฆ = 0^ ๐‘–๐‘“ ๐‘ฆ = 1

Logistic Regression relies highly on the proper presentation of data. So, to make the model more powerful, important features from the available data set are selected using Backward elimination and recursive elimination techniques.

4.2 Backward Elimination Method:

While building a machine learning model only the features which have a significant influence on the target variable should be selected. In the backward elimination method for feature selection, the first step is selecting a significance level or P-value. For our model, we have chosen a 5% significance level or P-value of 0.05. The feature with high P-value is identified, and if its P-value is greater than the significance level it is removed from the dataset. The model is fit again with a new dataset, and the process is repeated till all remaining features in dataset is less than the significance level. In this model, factors male, age, cigsPerDay, prevalentStroke, diabetes, and sysBP were chosen as significant ones after using the backward elimination algorithm.

CHAPTER 5: EXPERIMENTS

5.1 Data Preparation

Since the dataset consists of 4240 observations with 388 missing data and 644 observations to be risked for heart disease, two different experiments were performed for data preparation. First, we checked by dropping the missing data, leaving with only 3751 data and only 572 observations risked for heart disease.

This leads to reduced number of the observations providing irrelevant training to our model. So, we progressed with imputation of data with the mean value of the observations and scaling them using SimpleImputer and StandardScaler modules of Sklearn.

Figure 4: Dataset after Scaling and Imputing

Figure 3: Bar Graph of the Target Classes Before Dropping Figure 2: Bar Graph of the Target Classes before Dropping (^) Figure 3: Bar Graph of the Target Classes after DroppingFigure 2: Bar Graph of the Target Classes After Dropping

5.2 Exploratory Analysis:

Correlation Matrix visualization Before Feature Selection shows

Figure 5: Correlation Matrix Visualization

It shows that there is no single feature that has a very high correlation with our target value. Also, some of the features have a negative correlation with the target value and some have positive. The data was also visualized through plots and bar graphs.

Feature Selection using Recursive Feature Elimination and Cross-Validated selection method:

Figure 8: Top 10 important features supported by RFECV

5.4 Training and testing

Finally, this resulting data split into 80% train and 20% test data, which was further passed to the LogisticRegression model to fit, predict and score the model.

CHAPTER 6: EVALUATION METRICS

For the evaluation of our output from our training the data, the accuracy was analyzed โ€œConfusion matrixโ€.

6.1 Confusion Matrix

A confusion matrix, also known as an error matrix, is a table that is often used to describe the performance of a classification model (or โ€œclassifierโ€) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. It allows easy identification of confusion between classes e.g. one class is commonly mislabeled as the other. The key to the confusion matrix is the number of correct and incorrect predictions are summarized with count values and broken down by each class not just the number of errors made.

TP=3569 FP= FN=599 TN= Table 1: Confusion Matrix Obtained after training the data (feature selection by backward elimination)

TP=3582 FP= FN=600 TN= Table 2: Confusion Matrix Obtained after training the data (feature selection by RFECV method)

6.2 Accuracy

The accuracy is calculated as: Accuracy = (^) ๐‘‡๐‘ƒ+๐‘‡๐‘+๐น๐‘ƒ+๐น๐‘๐‘‡๐‘ƒ+๐‘‡๐‘

Where,

  • True Positive (TP) =Observation is positive, and is predicted to be positive.
  • False Negative (FN) = Observation is positive, but is predicted negative.
  • True Negative (TN) = Observation is negative, and is predicted to be negative.
  • False Positive (FP) =Observation is negative, but is predicted positive

The obtained accuracy during training the data after feature selection using backward elimination was 86 % and during testing was 83%.

The obtained accuracy during training the data after feature selection using REFCV method was 86 % and during testing was 85 %.

CHAPTER 6: DISCUSSION ON RESULTS

When performing various methods of feature selection, testing it was found that backward elimination gave us the best results among others. The various methods tried were Backward Elimination with and without KFold, Recursive Feature Elimination with Cross Validation. The accuracy that was seen in them ranged around 85% with 85.5% being maximum. Though both methods gave similar accuracy but it was seen that in Backward Elimination we found that the number of misclassifications of True Negative was more and it was observed that the accuracy had more variance compared to RFEV. The precision of Backward Elimination and RFEV are 84% and 86% respectively. And the recalls are 0.99 and 1 respectively. The precision and recall also shows that the number of misclassifications is less in RFECV than in Backward Elimination.

Evaluation Metrics Backward Elimination RFECV Accuracy 83% 85% Recall 0.99 0. Precision 0.84 0. Table 3: Comparison between the feature selection models after training and testing through LogisticRegression model

CHAPTER 7: CONTRIBUTIONS

Table 4: Work Division

Members Task Nirusha Manandhar Sagun Lal Shrestha Ruchi Tandukar Data Imputation and Scaling Data Cleaning

Exploratory Analysis

Feature Selection

Building Model

Result analysis and Accuracy Test Documentation