Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Machine Learning Approach for Heart Disease Prediction - Prof. Kumar, Schemes and Mind Maps of Machine Learning

Periyar University Machine Learning

Prof. Raj Kumar

A project that utilizes machine learning techniques such as backward elimination algorithm, logistic regression, and refcv to predict heart disease based on patient attributes. The project aims to identify impending heart disease and make lifestyle changes for high-risk patients, potentially reducing complications. The project uses a publicly available dataset from kaggle and evaluates the results using confusion matrix and cross-validation.

Typology: Schemes and Mind Maps

2022/2023

Uploaded on 04/10/2024

prabu-raj 🇮🇳

1 document

1 / 23

This page cannot be seen from the preview

Don't miss anything!

Kathmandu University

Department of Computer Science and Engineering

Dhulikhel, Kavre

A Mini-Project Report

“Heart Disease Prediction”

[COMP 484]

Machine Learning

Submitted by

Nirusha Manandhar (31)

Sagun Lal Shrestha (53)

Ruchi Tandukar (57)

Submitted to

Dr. Bal Krishna Bal

Associate Professor

Department of Computer Science and Engineering

Submission Date: 11th March 2020

Partial preview of the text

Download Machine Learning Approach for Heart Disease Prediction - Prof. Kumar and more Schemes and Mind Maps Machine Learning in PDF only on Docsity!

Kathmandu University

Department of Computer Science and Engineering

Dhulikhel, Kavre

A Mini-Project Report On “Heart Disease Prediction” [COMP 484] Machine Learning Submitted by Nirusha Manandhar (31) Sagun Lal Shrestha (53) Ruchi Tandukar (57)

Submitted to Dr. Bal Krishna Bal Associate Professor Department of Computer Science and Engineering

Submission Date: 11th^ March 2020

ABSTRACT..................................................................................................................................... i LIST OF FIGURES: ....................................................................................................................... ii LIST OF TABLES: ........................................................................................................................ iii

ii

CHAPTER 1: INTRODUCTION LIST OF ABBREVIATIONS: iv
- 1.1 Problem Definition
- 1.2 Motivation
- 1.3 Objectives
CHAPTER 2: RELATED WORKS................................................................................................
CHAPTER 3: DATASETS
CHAPTER 4: METHODS AND ALGORITHMS USED..............................................................
- 4.1 Logistic Regression
- 4.2 Backward Elimination Method:
- 4.3 Recursive Feature Elimination using Cross-Validation (RFECV)
CHAPTER 5: EXPERIMENTS
- 5.1 Data Preparation....................................................................................................................
- 5.2 Exploratory Analysis:
- 5.3 Feature Selection
- 5.4 Training and testing
CHAPTER 6: EVALUATION METRICS
- 6.1 Confusion Matrix
- 6.2 Accuracy
- 6.3 Recall
- 6.4 Precision
CHAPTER 6: DISCUSSION ON RESULTS
CHAPTER 7: CONTRIBUTIONS
CHAPTER 9: CODE
- 9.1 Libraries used:
CHAPTER 10: CONCLUSION
Figure 1: Original Dataset Snapshot LIST OF FIGURES:
Figure 2: Bar Graph of the Target Classes After Dropping
Figure 3: Bar Graph of the Target Classes Before Dropping
Figure 4: Dataset after Scaling and Imputing
Figure 5: Correlation Matrix Visualization.....................................................................................
Figure 6: Result from Feature Selection using Backward Elimination Method
Figure 7: Dataset After Dropping Columns after Feature Selection...............................................
Figure 8: Top 10 important features supported by RFECV

iii

LIST OF TABLES:

Table 1: Confusion Matrix Obtained after training the data (feature selection by backward elimination) ................................................................................................................................... 11 Table 2: Confusion Matrix Obtained after training the data (feature selection by RFECV method) ....................................................................................................................................................... 11 Table 3: Comparison between the feature selection models after training and testing through LogisticRegression model ............................................................................................................. 13 Table 4: Work Division ................................................................................................................ 14 Table 5: Major modules and classes used from Sklearn ............................................................... 15

CHAPTER 1: INTRODUCTION

According to the World Health Organization, every year 12 million deaths occur worldwide due to Heart Disease. The load of cardiovascular disease is rapidly increasing all over the world from the past few years. Many researches have been conducted in attempt to pinpoint the most influential factors of heart disease as well as accurately predict the overall risk. Heart Disease is even highlighted as a silent killer which leads to the death of the person without obvious symptoms. The early diagnosis of heart disease plays a vital role in making decisions on lifestyle changes in high-risk patients and in turn reduce the complications. This project aims to predict future Heart Disease by analyzing data of patients which classifies whether they have heart disease or not using machine-learning algorithms.

1.1 Problem Definition

The major challenge in heart disease is its detection. There are instruments available which can predict heart disease but either they are expensive or are not efficient to calculate chance of heart disease in human. Early detection of cardiac diseases can decrease the mortality rate and overall complications. However, it is not possible to monitor patients every day in all cases accurately and consultation of a patient for 24 hours by a doctor is not available since it requires more sapience, time and expertise. Since we have a good amount of data in today’s world, we can use various machine learning algorithms to analyze the data for hidden patterns. The hidden patterns can be used for health diagnosis in medicinal data.

1.2 Motivation

Machine learning techniques have been around us and has been compared and used for analysis for many kinds of data science applications. The major motivation behind this research-based project was to explore the feature selection methods, data preparation and processing behind the training models in the machine learning. With first hand models and libraries, the challenge we face today is data where beside their abundance, and our cooked models, the accuracy we see during training, testing and actual validation has a higher variance. Hence this project is carried out with the motivation to explore behind the models, and further implement Logistic Regression

model to train the obtained data. Furthermore, as the whole machine learning is motivated to develop an appropriate computer-based system and decision support that can aid to early detection of heart disease, in this project we have developed a model which classifies if patient will have heart disease in ten years or not based on various features (i.e. potential risk factors that can cause heart disease) using logistic regression. Hence, the early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients and in turn reduce the complications, which can be a great milestone in the field of medicine.

1.3 Objectives

The main objective of developing this project are:

To develop machine learning model to predict future possibility of heart disease by implementing Logistic Regression.
To determine significant risk factors based on medical dataset which may lead to heart disease.
To analyze feature selection methods and understand their working principle.

CHAPTER 3: DATASETS

The dataset is publicly available on the Kaggle Website at [4] which is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. It provides patient information which includes over 4000 records and 14 attributes. The attributes include: age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting, sugar blood, resting electrocardiographic results, maximum heart rate, exercise induced angina, ST depression induced by exercise, slope of the peak exercise, number of major vessels, and target ranging from 0 to 2, where 0 is absence of heart disease. The data set is in csv (Comma Separated Value) format which is further prepared to data frame as supported by pandas library in python.

Figure 1: Original Dataset Snapshot

The education data is irrelevant to the heart disease of an individual, so it is dropped. Further with this dataset pre-processing and experiments are then carried out.

CHAPTER 4: METHODS AND ALGORITHMS USED

The main purpose of designing this system is to predict the ten-year risk of future heart disease. We have used Logistic regression as a machine-learning algorithm to train our system and various feature selection algorithms like Backward elimination and Recursive feature elimination. These algorithms are discussed below in detail.

4.1 Logistic Regression

Logistic Regression is a supervised classification algorithm. It is a predictive analysis algorithm based on the concept of probability. It measures the relationship between the dependent variable (TenyearCHD) and the one or more independent variables (risk factors) by estimating probabilities using underlying logistic function (sigmoid function). Sigmoid function is used as a cost function to limit the hypothesis of logistic regression between 0 and 1 (squashing) i.e. 0 ≤ hθ (x) ≤ 1.

In logistic regression cost function is defined as:

𝐶𝑜𝑠𝑡(hθ(x), y) = {− log(ℎ𝜃(𝑥))− log(1 − ℎ𝜃(𝑥)) 𝑖𝑓 𝑦 = 0^ 𝑖𝑓 𝑦 = 1

Logistic Regression relies highly on the proper presentation of data. So, to make the model more powerful, important features from the available data set are selected using Backward elimination and recursive elimination techniques.

4.2 Backward Elimination Method:

While building a machine learning model only the features which have a significant influence on the target variable should be selected. In the backward elimination method for feature selection, the first step is selecting a significance level or P-value. For our model, we have chosen a 5% significance level or P-value of 0.05. The feature with high P-value is identified, and if its P-value is greater than the significance level it is removed from the dataset. The model is fit again with a new dataset, and the process is repeated till all remaining features in dataset is less than the significance level. In this model, factors male, age, cigsPerDay, prevalentStroke, diabetes, and sysBP were chosen as significant ones after using the backward elimination algorithm.

CHAPTER 5: EXPERIMENTS

5.1 Data Preparation

Since the dataset consists of 4240 observations with 388 missing data and 644 observations to be risked for heart disease, two different experiments were performed for data preparation. First, we checked by dropping the missing data, leaving with only 3751 data and only 572 observations risked for heart disease.

This leads to reduced number of the observations providing irrelevant training to our model. So, we progressed with imputation of data with the mean value of the observations and scaling them using SimpleImputer and StandardScaler modules of Sklearn.

Figure 4: Dataset after Scaling and Imputing

Figure 3: Bar Graph of the Target Classes Before Dropping Figure 2: Bar Graph of the Target Classes before Dropping (^) Figure 3: Bar Graph of the Target Classes after DroppingFigure 2: Bar Graph of the Target Classes After Dropping

5.2 Exploratory Analysis:

Correlation Matrix visualization Before Feature Selection shows

Figure 5: Correlation Matrix Visualization

It shows that there is no single feature that has a very high correlation with our target value. Also, some of the features have a negative correlation with the target value and some have positive. The data was also visualized through plots and bar graphs.

Feature Selection using Recursive Feature Elimination and Cross-Validated selection method:

Figure 8: Top 10 important features supported by RFECV

5.4 Training and testing

Finally, this resulting data split into 80% train and 20% test data, which was further passed to the LogisticRegression model to fit, predict and score the model.

CHAPTER 6: EVALUATION METRICS

For the evaluation of our output from our training the data, the accuracy was analyzed “Confusion matrix”.

6.1 Confusion Matrix

A confusion matrix, also known as an error matrix, is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. It allows easy identification of confusion between classes e.g. one class is commonly mislabeled as the other. The key to the confusion matrix is the number of correct and incorrect predictions are summarized with count values and broken down by each class not just the number of errors made.

TP=3569 FP= FN=599 TN= Table 1: Confusion Matrix Obtained after training the data (feature selection by backward elimination)

TP=3582 FP= FN=600 TN= Table 2: Confusion Matrix Obtained after training the data (feature selection by RFECV method)

6.2 Accuracy

The accuracy is calculated as: Accuracy = (^) 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁𝑇𝑃+𝑇𝑁

Where,

True Positive (TP) =Observation is positive, and is predicted to be positive.
False Negative (FN) = Observation is positive, but is predicted negative.
True Negative (TN) = Observation is negative, and is predicted to be negative.
False Positive (FP) =Observation is negative, but is predicted positive

The obtained accuracy during training the data after feature selection using backward elimination was 86 % and during testing was 83%.

The obtained accuracy during training the data after feature selection using REFCV method was 86 % and during testing was 85 %.

CHAPTER 6: DISCUSSION ON RESULTS

When performing various methods of feature selection, testing it was found that backward elimination gave us the best results among others. The various methods tried were Backward Elimination with and without KFold, Recursive Feature Elimination with Cross Validation. The accuracy that was seen in them ranged around 85% with 85.5% being maximum. Though both methods gave similar accuracy but it was seen that in Backward Elimination we found that the number of misclassifications of True Negative was more and it was observed that the accuracy had more variance compared to RFEV. The precision of Backward Elimination and RFEV are 84% and 86% respectively. And the recalls are 0.99 and 1 respectively. The precision and recall also shows that the number of misclassifications is less in RFECV than in Backward Elimination.

Evaluation Metrics Backward Elimination RFECV Accuracy 83% 85% Recall 0.99 0. Precision 0.84 0. Table 3: Comparison between the feature selection models after training and testing through LogisticRegression model

CHAPTER 7: CONTRIBUTIONS

Table 4: Work Division

Members Task Nirusha Manandhar Sagun Lal Shrestha Ruchi Tandukar Data Imputation and Scaling Data Cleaning

Exploratory Analysis

Feature Selection

Building Model

Result analysis and Accuracy Test Documentation

Machine Learning Approach for Heart Disease Prediction - Prof. Kumar, Schemes and Mind Maps of Machine Learning

Related documents

Partial preview of the text

Download Machine Learning Approach for Heart Disease Prediction - Prof. Kumar and more Schemes and Mind Maps Machine Learning in PDF only on Docsity!

Kathmandu University

Department of Computer Science and Engineering

Dhulikhel, Kavre

Submission Date: 11th^ March 2020

Table of Contents

ii

LIST OF TABLES:

CHAPTER 1: INTRODUCTION

1.1 Problem Definition

1.2 Motivation

1.3 Objectives

CHAPTER 3: DATASETS

CHAPTER 4: METHODS AND ALGORITHMS USED

4.1 Logistic Regression

4.2 Backward Elimination Method:

CHAPTER 5: EXPERIMENTS

5.1 Data Preparation

5.2 Exploratory Analysis:

5.4 Training and testing

CHAPTER 6: EVALUATION METRICS

6.1 Confusion Matrix

6.2 Accuracy

CHAPTER 6: DISCUSSION ON RESULTS

CHAPTER 7: CONTRIBUTIONS