Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Machine Learning On Prediciton of Customer churn in telecom industry, Lab Reports of Machine Learning

Project on Machine Learning Machine Learning On Prediciton of Customer churn in telecom industry

Typology: Lab Reports

2019/2020
On special offer
30 Points
Discount

Limited-time offer


Uploaded on 12/23/2020

saikat-dey
saikat-dey 🇮🇳

1 document

1 / 35

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Project Title: Prediction Of Customer Churn in Telecom
Company using Machine Learning.
Group Members:
Abhijan Bhattacharyya, UEM Kolkata (304201700900005)
Prajwal Bhimrao Walde, UEM Kolkata (304201700900406)
Saikat Dey, UEM Kolkata (304201700900516)
Soumya Ghosh, UEM Kolkata (304201700900678)
Sreetama Chanda, UEM Kolkata (304201700900722)
Document sign date :19 Sep, 2019
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
Discount

On special offer

Partial preview of the text

Download Machine Learning On Prediciton of Customer churn in telecom industry and more Lab Reports Machine Learning in PDF only on Docsity!

Project Title : Prediction Of Customer Churn in Telecom

Company using Machine Learning.

Group Members :

Abhijan Bhattacharyya, UEM Kolkata (304201700900005) Prajwal Bhimrao Walde, UEM Kolkata (304201700900406) Saikat Dey, UEM Kolkata (304201700900516) Soumya Ghosh, UEM Kolkata (304201700900678) Sreetama Chanda, UEM Kolkata (304201700900722)

CONTENTS

1. Acknowledgement. 2. Project Objective. 3. Project Scope. 4. Hardware and Software Requirements. 5. Data Description. 6. Exploratory Data Analysis. 7. Data Pre-processing. 8. Model Building. 9. Future scope of improvements. 10. Conclusion.

PROJECT OBJECTIVE

Customer churn is a financial term that refers to the loss of a client or customer—that is, when a customer ceases to interact with a company or business. Similarly, the churn rate is the rate at which customers or clients are leaving a company within a specific period of time. A churn rate higher than a certain threshold can have both tangible and intangible effects on a company’s business success. Ideally, companies like to retain as many customers as they can. With the advent of advanced data science and machine learning techniques, it’s now possible for companies to identify potential customers who may cease doing business with them in the near future. In this article, you’ll see how a telecom company can predict customer churn based on different customer attributes such as age, gender and more. The details of the features used for customer churn prediction are provided in a later section.

PROJECT SCOPE

The broad scope of the Telecom Company Customer Churning Prediction Machine project includes :

  1. The system will be available on an online platform for access of the higher authorities and members of telecom company for observing the possibilities of customer churning for certain facts in their services.
  2. The system will provide basic information about each and every customers attached with the respective telecom company and the reason behind their churning.
  3. The system will provide higher accuracy of prediction so that the telecom company can fix their customers issues and problems to prevent customer churning and bring new customer to the company.

DATA DESCRIPTION

Source of Data : Kaggle.com Taking a closer look, we see that the dataset contains 21 columns (also known as features or variables ). The first 20 columns are the independent variable, while the last column is the dependent variable that contains a binary value of 1 or 0. Here, 1 refers to the case where the customer left the telecom operator, and 0 is the case where the customer didn’t leave the telecom operator after a period of time. This is known as a binary classification problem , where you have only two possible values for the dependent variable—in this case, a customer either leaves the telecom operator after a period of time or doesn’t. To print the whole column and rows of data, the code is: data =pd.read_csv("Telecom.csv") print(data.head())

Now, to know the data type of the columns and the no of elements present in each columns we write: data =pd.read_csv("Telecom.csv") data.info()

  1. StreamingTV – It denotes whether the customer stream medias on TV.
  2. StreamingMovies – It denotes whether the customer stream movies and cinemas 15 .Contract – It denotes the time period of subscription of a particular telecom plan by the customer.
  3. Paperless Billing – It denotes the payment method which is either online or offline.
  4. Payment Method – It denotes the method of payment in online.
  5. Monthly Charges – It denotes the amount of charges required.
  6. Total Charges – It denotes the total bill of a customer including several other services each individual customer has chosen.
  7. Churn – It denotes whether the customer has churned from the telecom company. All twenty columns are categorical columns present in the datasheet on which our project is based on. There is no null value present in the datasheet. To confirm the null value occurrence and calculate its percentage we can proceed with the following code : data=pd.read_csv(“Telecom.csv”) print(data.isnul().sum())

Our dataset doesn’t contain any outliers. Outliers are unwanted or junk data that reduce the accuracy of the machine designed for prediction. However, to be on the safe side we have checked if any outliers are important for our project and if there are any outliers present or not. The code is : sns.boxplot(x=data[“tenure”],hue=data[“Churn”],data=data) Like this we have plotted the other boxplot with respect to churn vs other data columns.

DATA PRE-PROCESSING

First, we checked the data types of each columns. For the total charge column we explicitly change the data type from object to float and for the other object type columns we replace all the “yes” by 1 and all “no” by 0. Now we count the number of yes and no in each column. From this observations, we can see that in each column the number of no is greater than yes. And from the Count Plot we observe that the “No Phone Service” data and “ No Internet Service” data is not affecting the final outcome. So we replace this data with zero(0). And for the string type columns ( Gender , Tenants etc) we used Label Encoder to encode those strings into particular integer values. According to correlation we plot the countplot of each column with respect to churn to get a clear understanding about the more related columns. Now we plot the discplot to measure the frequency distribution for the highly corelated columns and for outliers we used boxplots to identify the outliers. For those columns we are not evenly distributed we scaled the values of that column to get a clear understanding of the distribution curve

MODEL BUILDING

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score,precision_score,recall_score from sklearn.metrics import f1_score from sklearn.metrics import confusion_matrix from sklearn.feature_selection import SelectKBest,f_classif data=pd.read_csv("Telecom.csv") data.drop("customerID",axis=1,inplace=True) data["TotalCharges"]=data["TotalCharges"].replace(r'\s+',np.nan,reg ex=True) data["TotalCharges"]=pd.to_numeric(data["TotalCharges"]) le_is=LabelEncoder() data["internet_n"]=le_is.fit_transform(data["InternetService"]) data.drop("InternetService",axis=1,inplace=True) data.replace(["Yes","No"],[1,0],inplace=True) data.replace(["No internet service","No phone service"],["9","9"],inplace=True) le_gender=LabelEncoder()