Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Analytics on the case study, Thesis of Integrated Case Studies

Indian Institute of Science Integrated Case Studies

This thesis consist of analysis on the Walmart dataset all the analysis including confidence Intervals and central limit theorem based on proper coverage of probability and statistics are present with insights and observations

Typology: Thesis

2022/2023

Uploaded on 03/24/2023

18h51a04h4-peshimaam-mohammed-muzam 🇮🇳

1 document

1 / 37

This page cannot be seen from the preview

Don't miss anything!

ohd5udrk2

March 24, 2023

[174]: import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

[175]: df =pd.read_csv('walmart_data.csv')

[175]: User_ID Product_ID Gender Age Occupation City_Category \

0 1000001 P00069042 F 0-17 10.0 A

1 1000001 P00248942 F 0-17 10.0 A

2 1000001 P00087842 F 0-17 10.0 A

3 1000001 P00085442 F 0-17 10.0 A

4 1000002 P00285442 M 55+ 16.0 C

… … … … … … …

375654 1003822 P00116742 M 46-50 20.0 C

375655 1003822 P00182142 M 46-50 20.0 C

375656 1003823 P00112442 M 55+ 7.0 B

375657 1003823 P00182342 M 55+ 7.0 B

375658 1003823 P00 NaN NaN NaN NaN

Stay_In_Current_City_Years Marital_Status Product_Category Purchase

0 2 0.0 3.0 8370.0

1 2 0.0 1.0 15200.0

2 2 0.0 12.0 1422.0

3 2 0.0 12.0 1057.0

4 4+ 0.0 8.0 7969.0

… … … … …

375654 1 0.0 11.0 6092.0

375655 1 0.0 1.0 11581.0

375656 1 0.0 6.0 20414.0

375657 1 0.0 1.0 19696.0

375658 NaN NaN NaN NaN

[375659 rows x 10 columns]

Partial preview of the text

Download Analytics on the case study and more Thesis Integrated Case Studies in PDF only on Docsity!

ohd5udrk

March 24, 2023

[174]: import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns

[175]: df = pd.read_csv('walmart_data.csv') df

[175]: User_ID Product_ID Gender Age Occupation City_Category
0 1000001 P00069042 F 0-17 10.0 A 1 1000001 P00248942 F 0-17 10.0 A 2 1000001 P00087842 F 0-17 10.0 A 3 1000001 P00085442 F 0-17 10.0 A 4 1000002 P00285442 M 55+ 16.0 C … … … … … … … 375654 1003822 P00116742 M 46-50 20.0 C 375655 1003822 P00182142 M 46-50 20.0 C 375656 1003823 P00112442 M 55+ 7.0 B 375657 1003823 P00182342 M 55+ 7.0 B 375658 1003823 P00 NaN NaN NaN NaN

Stay_In_Current_City_Years Marital_Status Product_Category Purchase 0 2 0.0 3.0 8370. 1 2 0.0 1.0 15200. 2 2 0.0 12.0 1422. 3 2 0.0 12.0 1057. 4 4+ 0.0 8.0 7969. … … … … … 375654 1 0.0 11.0 6092. 375655 1 0.0 1.0 11581. 375656 1 0.0 6.0 20414. 375657 1 0.0 1.0 19696. 375658 NaN NaN NaN NaN

[375659 rows x 10 columns]

1 Defining Problem Statement and Analyzing basic metrics

The Management team at Walmart Inc. wants to analyze the customer purchase behavior (specif- ically, purchase amount) against the customer’s gender and the various other factors to help the business make better decisions. They want to understand if the spending habits differ between male and female customers: Do women spend more on Black Friday than men? (Assume 50 million customers are male and 50 million are female).

1.1 Observations on shape of data, data types of all the attributes, conversion

of categorical attributes to ‘category’ (If required), statistical summary

<class 'pandas.core.frame.DataFrame'> RangeIndex: 375659 entries, 0 to 375658 Data columns (total 10 columns):

Column Non-Null Count Dtype

0 User_ID 375659 non-null int 1 Product_ID 375659 non-null object 2 Gender 375658 non-null object 3 Age 375658 non-null object 4 Occupation 375658 non-null float 5 City_Category 375658 non-null object 6 Stay_In_Current_City_Years 375658 non-null object 7 Marital_Status 375658 non-null float 8 Product_Category 375658 non-null float 9 Purchase 375658 non-null float dtypes: float64(4), int64(1), object(5) memory usage: 28.7+ MB

177: User_ID 0 Product_ID 0 Gender 1 Age 1 Occupation 1 City_Category 1 Stay_In_Current_City_Years 1 Marital_Status 1 Product_Category 1 Purchase 1 dtype: int

[178]: # Their is only one row with Nan value so better to drop it df = df.dropna(axis = 0)

City_Category Stay_In_Current_City_Years count 375658 375658 unique 3 5 top B 1 freq 158354 132090

1.2 Non-Graphical Analysis: Value counts and unique attributes

[182]: # ␣

↪ User_ID Occupation Marital_Status Product_Category Purchase df['User_ID'].value_counts()

[182]: 1001680 741 1004277 631 1001941 629 1001181 605 1000889 603 … 1004178 4 1005110 3 1004527 3 1002111 3 1005391 3 Name: User_ID, Length: 5891, dtype: int

183: array([1000001, 1000002, 1000003, …, 1004113, 1005391, 1001529])

184: 4.0 49774 0.0 47660 7.0 40414 1.0 32075 17.0 27327 20.0 23071 12.0 21119 14.0 18642 2.0 18116 16.0 17207 6.0 13900 3.0 12104 10.0 8870 15.0 8256 5.0 8240

[185]: df['Occupation'].unique()

[185]: array([10.0, 16.0, 15.0, 7.0, 20.0, 9.0, 1.0, 12.0, 17.0, 0.0, 3.0, 4.0, 11.0, 8.0, 19.0, 2.0, 18.0, 5.0, 14.0, 13.0, 6.0], dtype=object) [187]: df['Marital_Status'].unique() [187]: array([0.0, 1.0], dtype=object)

11.0
19.0
13.0
18.0
9.0
8.0
Name: Occupation, dtype: int
[186]: 0.0 [186]: df['Marital_Status'].value_counts()
- 1.0
- Name: Marital_Status, dtype: int
[188]: 5.0 [188]: df['Product_Category'].value_counts()
- 1.0
- 8.0
- 11.0
- 2.0
- 3.0
- 6.0
- 4.0
- 16.0
- 15.0
- 13.0
- 10.0
- 12.0
- 7.0
- 18.0
- 14.0
- 17.0
- 9.0
- Name: Product_Category, dtype: int

[195]: plt.figure(figsize = (10, 6)) sns.countplot(x = 'Gender', data = df) plt.show()

[196]: plt.figure(figsize = (10, 6)) sns.countplot(x = 'Occupation', data = df) plt.show()

[197]: plt.figure(figsize = (10, 6)) sns.countplot(x = 'City_Category', data = df) plt.show()

[199]: plt.figure(figsize=(10, 8)) sns.countplot(data=df, x='Product_Category') plt.show()

[200]: plt.figure(figsize=(10, 8)) sns.countplot(data=df, x='Stay_In_Current_City_Years') plt.show()

[201]: plt.figure(figsize=(10, 8)) sns.countplot(data=df, x='Age') plt.show()

[203]: sns.boxplot(data=df, x='Occupation', orient='h') plt.show()

[204]: sns.boxplot(data=df, x='Product_Category', orient='h') plt.show()

[205]: plt.figure(figsize = (10, 6)) sns.boxplot(data = df, x = 'Gender', y = 'Purchase') plt.show()

[207]: plt.figure(figsize = (10, 6)) sns.boxplot(data = df, x = 'Occupation', y = 'Purchase') plt.show()

[208]: plt.figure(figsize = (10, 6)) sns.boxplot(data = df, x = 'City_Category', y = 'Purchase') plt.show()

[209]: plt.figure(figsize = (10, 6)) sns.boxplot(data = df, x = 'Stay_In_Current_City_Years', y = 'Purchase') plt.show()

For correlation: Heatmaps, Pairplots

[212]: plt.figure(figsize = (10, 6)) sns.heatmap( df.corr() , annot= True ,linewidth = 0.5 , cmap = 'coolwarm') plt.show()

213: sns.pairplot(df, hue = 'Gender')

Analytics on the case study, Thesis of Integrated Case Studies

Related documents

Partial preview of the text

Download Analytics on the case study and more Thesis Integrated Case Studies in PDF only on Docsity!

ohd5udrk

March 24, 2023

1.1 Observations on shape of data, data types of all the attributes, conversion

of categorical attributes to ‘category’ (If required), statistical summary

Column Non-Null Count Dtype

1.2 Non-Graphical Analysis: Value counts and unique attributes

[182]: # ␣

[185]: df['Occupation'].unique()