



























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Project on Machine Learning Machine Learning On Prediciton of Customer churn in telecom industry
Typology: Lab Reports
1 / 35
This page cannot be seen from the preview
Don't miss anything!
On special offer
Abhijan Bhattacharyya, UEM Kolkata (304201700900005) Prajwal Bhimrao Walde, UEM Kolkata (304201700900406) Saikat Dey, UEM Kolkata (304201700900516) Soumya Ghosh, UEM Kolkata (304201700900678) Sreetama Chanda, UEM Kolkata (304201700900722)
1. Acknowledgement. 2. Project Objective. 3. Project Scope. 4. Hardware and Software Requirements. 5. Data Description. 6. Exploratory Data Analysis. 7. Data Pre-processing. 8. Model Building. 9. Future scope of improvements. 10. Conclusion.
Customer churn is a financial term that refers to the loss of a client or customer—that is, when a customer ceases to interact with a company or business. Similarly, the churn rate is the rate at which customers or clients are leaving a company within a specific period of time. A churn rate higher than a certain threshold can have both tangible and intangible effects on a company’s business success. Ideally, companies like to retain as many customers as they can. With the advent of advanced data science and machine learning techniques, it’s now possible for companies to identify potential customers who may cease doing business with them in the near future. In this article, you’ll see how a telecom company can predict customer churn based on different customer attributes such as age, gender and more. The details of the features used for customer churn prediction are provided in a later section.
The broad scope of the Telecom Company Customer Churning Prediction Machine project includes :
Source of Data : Kaggle.com Taking a closer look, we see that the dataset contains 21 columns (also known as features or variables ). The first 20 columns are the independent variable, while the last column is the dependent variable that contains a binary value of 1 or 0. Here, 1 refers to the case where the customer left the telecom operator, and 0 is the case where the customer didn’t leave the telecom operator after a period of time. This is known as a binary classification problem , where you have only two possible values for the dependent variable—in this case, a customer either leaves the telecom operator after a period of time or doesn’t. To print the whole column and rows of data, the code is: data =pd.read_csv("Telecom.csv") print(data.head())
Now, to know the data type of the columns and the no of elements present in each columns we write: data =pd.read_csv("Telecom.csv") data.info()
Our dataset doesn’t contain any outliers. Outliers are unwanted or junk data that reduce the accuracy of the machine designed for prediction. However, to be on the safe side we have checked if any outliers are important for our project and if there are any outliers present or not. The code is : sns.boxplot(x=data[“tenure”],hue=data[“Churn”],data=data) Like this we have plotted the other boxplot with respect to churn vs other data columns.
First, we checked the data types of each columns. For the total charge column we explicitly change the data type from object to float and for the other object type columns we replace all the “yes” by 1 and all “no” by 0. Now we count the number of yes and no in each column. From this observations, we can see that in each column the number of no is greater than yes. And from the Count Plot we observe that the “No Phone Service” data and “ No Internet Service” data is not affecting the final outcome. So we replace this data with zero(0). And for the string type columns ( Gender , Tenants etc) we used Label Encoder to encode those strings into particular integer values. According to correlation we plot the countplot of each column with respect to churn to get a clear understanding about the more related columns. Now we plot the discplot to measure the frequency distribution for the highly corelated columns and for outliers we used boxplots to identify the outliers. For those columns we are not evenly distributed we scaled the values of that column to get a clear understanding of the distribution curve
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score,precision_score,recall_score from sklearn.metrics import f1_score from sklearn.metrics import confusion_matrix from sklearn.feature_selection import SelectKBest,f_classif data=pd.read_csv("Telecom.csv") data.drop("customerID",axis=1,inplace=True) data["TotalCharges"]=data["TotalCharges"].replace(r'\s+',np.nan,reg ex=True) data["TotalCharges"]=pd.to_numeric(data["TotalCharges"]) le_is=LabelEncoder() data["internet_n"]=le_is.fit_transform(data["InternetService"]) data.drop("InternetService",axis=1,inplace=True) data.replace(["Yes","No"],[1,0],inplace=True) data.replace(["No internet service","No phone service"],["9","9"],inplace=True) le_gender=LabelEncoder()