





























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
This thesis consist of analysis on the Walmart dataset all the analysis including confidence Intervals and central limit theorem based on proper coverage of probability and statistics are present with insights and observations
Typology: Thesis
1 / 37
This page cannot be seen from the preview
Don't miss anything!
[174]: import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
[175]: df = pd.read_csv('walmart_data.csv') df
[175]: User_ID Product_ID Gender Age Occupation City_Category
0 1000001 P00069042 F 0-17 10.0 A 1 1000001 P00248942 F 0-17 10.0 A 2 1000001 P00087842 F 0-17 10.0 A 3 1000001 P00085442 F 0-17 10.0 A 4 1000002 P00285442 M 55+ 16.0 C … … … … … … … 375654 1003822 P00116742 M 46-50 20.0 C 375655 1003822 P00182142 M 46-50 20.0 C 375656 1003823 P00112442 M 55+ 7.0 B 375657 1003823 P00182342 M 55+ 7.0 B 375658 1003823 P00 NaN NaN NaN NaN
Stay_In_Current_City_Years Marital_Status Product_Category Purchase 0 2 0.0 3.0 8370. 1 2 0.0 1.0 15200. 2 2 0.0 12.0 1422. 3 2 0.0 12.0 1057. 4 4+ 0.0 8.0 7969. … … … … … 375654 1 0.0 11.0 6092. 375655 1 0.0 1.0 11581. 375656 1 0.0 6.0 20414. 375657 1 0.0 1.0 19696. 375658 NaN NaN NaN NaN
[375659 rows x 10 columns]
1 Defining Problem Statement and Analyzing basic metrics
The Management team at Walmart Inc. wants to analyze the customer purchase behavior (specif- ically, purchase amount) against the customer’s gender and the various other factors to help the business make better decisions. They want to understand if the spending habits differ between male and female customers: Do women spend more on Black Friday than men? (Assume 50 million customers are male and 50 million are female).
<class 'pandas.core.frame.DataFrame'> RangeIndex: 375659 entries, 0 to 375658 Data columns (total 10 columns):
0 User_ID 375659 non-null int 1 Product_ID 375659 non-null object 2 Gender 375658 non-null object 3 Age 375658 non-null object 4 Occupation 375658 non-null float 5 City_Category 375658 non-null object 6 Stay_In_Current_City_Years 375658 non-null object 7 Marital_Status 375658 non-null float 8 Product_Category 375658 non-null float 9 Purchase 375658 non-null float dtypes: float64(4), int64(1), object(5) memory usage: 28.7+ MB
177: User_ID 0 Product_ID 0 Gender 1 Age 1 Occupation 1 City_Category 1 Stay_In_Current_City_Years 1 Marital_Status 1 Product_Category 1 Purchase 1 dtype: int
[178]: # Their is only one row with Nan value so better to drop it df = df.dropna(axis = 0)
City_Category Stay_In_Current_City_Years count 375658 375658 unique 3 5 top B 1 freq 158354 132090
↪ User_ID Occupation Marital_Status Product_Category Purchase df['User_ID'].value_counts()
[182]: 1001680 741 1004277 631 1001941 629 1001181 605 1000889 603 … 1004178 4 1005110 3 1004527 3 1002111 3 1005391 3 Name: User_ID, Length: 5891, dtype: int
183: array([1000001, 1000002, 1000003, …, 1004113, 1005391, 1001529])
184: 4.0 49774 0.0 47660 7.0 40414 1.0 32075 17.0 27327 20.0 23071 12.0 21119 14.0 18642 2.0 18116 16.0 17207 6.0 13900 3.0 12104 10.0 8870 15.0 8256 5.0 8240
[185]: array([10.0, 16.0, 15.0, 7.0, 20.0, 9.0, 1.0, 12.0, 17.0, 0.0, 3.0, 4.0, 11.0, 8.0, 19.0, 2.0, 18.0, 5.0, 14.0, 13.0, 6.0], dtype=object) [187]: df['Marital_Status'].unique() [187]: array([0.0, 1.0], dtype=object)
[195]: plt.figure(figsize = (10, 6)) sns.countplot(x = 'Gender', data = df) plt.show()
[196]: plt.figure(figsize = (10, 6)) sns.countplot(x = 'Occupation', data = df) plt.show()
[197]: plt.figure(figsize = (10, 6)) sns.countplot(x = 'City_Category', data = df) plt.show()
[199]: plt.figure(figsize=(10, 8)) sns.countplot(data=df, x='Product_Category') plt.show()
[200]: plt.figure(figsize=(10, 8)) sns.countplot(data=df, x='Stay_In_Current_City_Years') plt.show()
[201]: plt.figure(figsize=(10, 8)) sns.countplot(data=df, x='Age') plt.show()
[203]: sns.boxplot(data=df, x='Occupation', orient='h') plt.show()
[204]: sns.boxplot(data=df, x='Product_Category', orient='h') plt.show()
[205]: plt.figure(figsize = (10, 6)) sns.boxplot(data = df, x = 'Gender', y = 'Purchase') plt.show()
[207]: plt.figure(figsize = (10, 6)) sns.boxplot(data = df, x = 'Occupation', y = 'Purchase') plt.show()
[208]: plt.figure(figsize = (10, 6)) sns.boxplot(data = df, x = 'City_Category', y = 'Purchase') plt.show()
[209]: plt.figure(figsize = (10, 6)) sns.boxplot(data = df, x = 'Stay_In_Current_City_Years', y = 'Purchase') plt.show()
For correlation: Heatmaps, Pairplots
[212]: plt.figure(figsize = (10, 6)) sns.heatmap( df.corr() , annot= True ,linewidth = 0.5 , cmap = 'coolwarm') plt.show()
213: sns.pairplot(df, hue = 'Gender')