Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Analytics on the case study, Thesis of Integrated Case Studies

This thesis consist of analysis on the Walmart dataset all the analysis including confidence Intervals and central limit theorem based on proper coverage of probability and statistics are present with insights and observations

Typology: Thesis

2022/2023

Uploaded on 03/24/2023

18h51a04h4-peshimaam-mohammed-muzam
18h51a04h4-peshimaam-mohammed-muzam 🇮🇳

1 document

1 / 37

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ohd5udrk2
March 24, 2023
[174]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
[175]: df =pd.read_csv('walmart_data.csv')
df
[175]: User_ID Product_ID Gender Age Occupation City_Category \
0 1000001 P00069042 F 0-17 10.0 A
1 1000001 P00248942 F 0-17 10.0 A
2 1000001 P00087842 F 0-17 10.0 A
3 1000001 P00085442 F 0-17 10.0 A
4 1000002 P00285442 M 55+ 16.0 C
375654 1003822 P00116742 M 46-50 20.0 C
375655 1003822 P00182142 M 46-50 20.0 C
375656 1003823 P00112442 M 55+ 7.0 B
375657 1003823 P00182342 M 55+ 7.0 B
375658 1003823 P00 NaN NaN NaN NaN
Stay_In_Current_City_Years Marital_Status Product_Category Purchase
0 2 0.0 3.0 8370.0
1 2 0.0 1.0 15200.0
2 2 0.0 12.0 1422.0
3 2 0.0 12.0 1057.0
4 4+ 0.0 8.0 7969.0
375654 1 0.0 11.0 6092.0
375655 1 0.0 1.0 11581.0
375656 1 0.0 6.0 20414.0
375657 1 0.0 1.0 19696.0
375658 NaN NaN NaN NaN
[375659 rows x 10 columns]
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25

Partial preview of the text

Download Analytics on the case study and more Thesis Integrated Case Studies in PDF only on Docsity!

ohd5udrk

March 24, 2023

[174]: import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns

[175]: df = pd.read_csv('walmart_data.csv') df

[175]: User_ID Product_ID Gender Age Occupation City_Category
0 1000001 P00069042 F 0-17 10.0 A 1 1000001 P00248942 F 0-17 10.0 A 2 1000001 P00087842 F 0-17 10.0 A 3 1000001 P00085442 F 0-17 10.0 A 4 1000002 P00285442 M 55+ 16.0 C … … … … … … … 375654 1003822 P00116742 M 46-50 20.0 C 375655 1003822 P00182142 M 46-50 20.0 C 375656 1003823 P00112442 M 55+ 7.0 B 375657 1003823 P00182342 M 55+ 7.0 B 375658 1003823 P00 NaN NaN NaN NaN

Stay_In_Current_City_Years Marital_Status Product_Category Purchase 0 2 0.0 3.0 8370. 1 2 0.0 1.0 15200. 2 2 0.0 12.0 1422. 3 2 0.0 12.0 1057. 4 4+ 0.0 8.0 7969. … … … … … 375654 1 0.0 11.0 6092. 375655 1 0.0 1.0 11581. 375656 1 0.0 6.0 20414. 375657 1 0.0 1.0 19696. 375658 NaN NaN NaN NaN

[375659 rows x 10 columns]

1 Defining Problem Statement and Analyzing basic metrics

The Management team at Walmart Inc. wants to analyze the customer purchase behavior (specif- ically, purchase amount) against the customer’s gender and the various other factors to help the business make better decisions. They want to understand if the spending habits differ between male and female customers: Do women spend more on Black Friday than men? (Assume 50 million customers are male and 50 million are female).

1.1 Observations on shape of data, data types of all the attributes, conversion

of categorical attributes to ‘category’ (If required), statistical summary

<class 'pandas.core.frame.DataFrame'> RangeIndex: 375659 entries, 0 to 375658 Data columns (total 10 columns):

Column Non-Null Count Dtype


0 User_ID 375659 non-null int 1 Product_ID 375659 non-null object 2 Gender 375658 non-null object 3 Age 375658 non-null object 4 Occupation 375658 non-null float 5 City_Category 375658 non-null object 6 Stay_In_Current_City_Years 375658 non-null object 7 Marital_Status 375658 non-null float 8 Product_Category 375658 non-null float 9 Purchase 375658 non-null float dtypes: float64(4), int64(1), object(5) memory usage: 28.7+ MB

177: User_ID 0 Product_ID 0 Gender 1 Age 1 Occupation 1 City_Category 1 Stay_In_Current_City_Years 1 Marital_Status 1 Product_Category 1 Purchase 1 dtype: int

[178]: # Their is only one row with Nan value so better to drop it df = df.dropna(axis = 0)

City_Category Stay_In_Current_City_Years count 375658 375658 unique 3 5 top B 1 freq 158354 132090

1.2 Non-Graphical Analysis: Value counts and unique attributes

[182]: # ␣

User_ID Occupation Marital_Status Product_Category Purchase df['User_ID'].value_counts()

[182]: 1001680 741 1004277 631 1001941 629 1001181 605 1000889 603 … 1004178 4 1005110 3 1004527 3 1002111 3 1005391 3 Name: User_ID, Length: 5891, dtype: int

183: array([1000001, 1000002, 1000003, …, 1004113, 1005391, 1001529])

184: 4.0 49774 0.0 47660 7.0 40414 1.0 32075 17.0 27327 20.0 23071 12.0 21119 14.0 18642 2.0 18116 16.0 17207 6.0 13900 3.0 12104 10.0 8870 15.0 8256 5.0 8240

[185]: df['Occupation'].unique()

[185]: array([10.0, 16.0, 15.0, 7.0, 20.0, 9.0, 1.0, 12.0, 17.0, 0.0, 3.0, 4.0, 11.0, 8.0, 19.0, 2.0, 18.0, 5.0, 14.0, 13.0, 6.0], dtype=object) [187]: df['Marital_Status'].unique() [187]: array([0.0, 1.0], dtype=object)

  • 11.0
  • 19.0
  • 13.0
  • 18.0
  • 9.0
  • 8.0
  • Name: Occupation, dtype: int
  • [186]: 0.0 [186]: df['Marital_Status'].value_counts()
    • 1.0
    • Name: Marital_Status, dtype: int
  • [188]: 5.0 [188]: df['Product_Category'].value_counts()
    • 1.0
    • 8.0
    • 11.0
    • 2.0
    • 3.0
    • 6.0
    • 4.0
    • 16.0
    • 15.0
    • 13.0
    • 10.0
    • 12.0
    • 7.0
    • 18.0
    • 14.0
    • 17.0
    • 9.0
    • Name: Product_Category, dtype: int

[195]: plt.figure(figsize = (10, 6)) sns.countplot(x = 'Gender', data = df) plt.show()

[196]: plt.figure(figsize = (10, 6)) sns.countplot(x = 'Occupation', data = df) plt.show()

[197]: plt.figure(figsize = (10, 6)) sns.countplot(x = 'City_Category', data = df) plt.show()

[199]: plt.figure(figsize=(10, 8)) sns.countplot(data=df, x='Product_Category') plt.show()

[200]: plt.figure(figsize=(10, 8)) sns.countplot(data=df, x='Stay_In_Current_City_Years') plt.show()

[201]: plt.figure(figsize=(10, 8)) sns.countplot(data=df, x='Age') plt.show()

[203]: sns.boxplot(data=df, x='Occupation', orient='h') plt.show()

[204]: sns.boxplot(data=df, x='Product_Category', orient='h') plt.show()

[205]: plt.figure(figsize = (10, 6)) sns.boxplot(data = df, x = 'Gender', y = 'Purchase') plt.show()

[207]: plt.figure(figsize = (10, 6)) sns.boxplot(data = df, x = 'Occupation', y = 'Purchase') plt.show()

[208]: plt.figure(figsize = (10, 6)) sns.boxplot(data = df, x = 'City_Category', y = 'Purchase') plt.show()

[209]: plt.figure(figsize = (10, 6)) sns.boxplot(data = df, x = 'Stay_In_Current_City_Years', y = 'Purchase') plt.show()

For correlation: Heatmaps, Pairplots

[212]: plt.figure(figsize = (10, 6)) sns.heatmap( df.corr() , annot= True ,linewidth = 0.5 , cmap = 'coolwarm') plt.show()

213: sns.pairplot(df, hue = 'Gender')