Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

data aggregation in python, Summaries of Programming Languages

the file briefly explains the use of data aggregation in python programming

Typology: Summaries

2020/2021

Uploaded on 06/18/2023

AvaniPrithiani
AvaniPrithiani 🇮🇳

2 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Python | Pandas dataframe.aggregate()
Python is a great language for doing data analysis, primarily because of the fantastic
ecosystem of data-centric Python packages. Pandas is one of those packages and makes
importing and analyzing data much easier.
Dataframe.aggregate() function is used to apply some aggregation across one or more
column. Aggregate using callable, string, dict, or list of string/callables. Most frequently used
aggregations are:
sum: Return the sum of the values for the requested axis
min: Return the minimum of the values for the requested axis
max: Return the maximum of the values for the requested axis
Syntax: DataFrame.aggregate(func, axis=0, *args, **kwargs)
Parameters:
func : callable, string, dictionary, or list of string/callables. Function to use for aggregating
the data. If a function, must either work when passed a DataFrame or when passed to
DataFrame.apply. For a DataFrame, can pass a dict, if the keys are DataFrame column
names.
axis : (default 0) {0 or ‘index’, 1 or ‘columns’} 0 or ‘index’: apply function to each column.
1 or ‘columns’: apply function to each row.
Returns: Aggregated DataFrame
For link to CSV file Used in Code, click here
Example #1: Aggregate ‘sum’ and ‘min’ function across all the columns in data frame.
# importing pandas package
import pandas as pd
# making data frame from csv file
df = pd.read_csv("nba.csv")
# printing the first 10 rows of the dataframe
df[:10]
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download data aggregation in python and more Summaries Programming Languages in PDF only on Docsity!

Python | Pandas dataframe.aggregate()

Python is a great language for doing data analysis, primarily because of the fantastic

ecosystem of data-centric Python packages. Pandas is one of those packages and makes

importing and analyzing data much easier.

Dataframe.aggregate() function is used to apply some aggregation across one or more

column. Aggregate using callable, string, dict, or list of string/callables. Most frequently used

aggregations are:

sum: Return the sum of the values for the requested axis

min: Return the minimum of the values for the requested axis

max: Return the maximum of the values for the requested axis

Syntax: DataFrame.aggregate(func, axis=0, *args, **kwargs)

Parameters:

func : callable, string, dictionary, or list of string/callables. Function to use for aggregating

the data. If a function, must either work when passed a DataFrame or when passed to

DataFrame.apply. For a DataFrame, can pass a dict, if the keys are DataFrame column

names.

axis : (default 0) {0 or ‘index’, 1 or ‘columns’} 0 or ‘index’: apply function to each column.

1 or ‘columns’: apply function to each row.

Returns: Aggregated DataFrame

For link to CSV file Used in Code, click here

Example #1: Aggregate ‘sum’ and ‘min’ function across all the columns in data frame.

importing pandas package

import pandas as pd

making data frame from csv file

df = pd.read_csv("nba.csv")

printing the first 10 rows of the dataframe

df[:10]

Aggregation works with only numeric type columns.

Applying aggregation across all the columns

sum and min will be found for each

numeric type column in df dataframe

df.aggregate(['sum', 'min'])

Output:

For each column which are having numeric values, minimum and sum of all values has been

found. For dataframe df , we have four such columns Number, Age, Weight, Salary.

Example #2:

In Pandas, we can also apply different aggregation functions across different columns. For

that, we need to pass a dictionary with key containing the column names and values

containing the list of aggregation functions for any specific column.

importing pandas package

import pandas as pd

making data frame from csv file

df = pd.read_csv("nba.csv")

We are going to find aggregation for these columns

df.aggregate({"Number":['sum', 'min'], "Age":['max', 'min'], "Weight":['min', 'sum'], "Salary":['sum']})

df = pd.read_csv("nba.csv")

Print the dataframe

df

Now apply the groupby() function.

applying groupby() function to

group the data on team value.

gk = df.groupby('Team')

Let's print the first entries

in all the groups formed.

gk.first()

Output :

Let’s print the value contained any one of group. For that use the name of the team. We use

the function get_group() to find the entries contained in any of the groups.

Finding the values contained in the "Boston Celtics" group

gk.get_group('Boston Celtics')

Output :

Example #2: Use groupby() function to form groups based on more than one category (i.e.

Use more than one column to perform the splitting).

importing pandas as pd

import pandas as pd

Creating the dataframe

df = pd.read_csv("nba.csv")

First grouping based on "Team"

Within each team we are grouping based on "Position"

gkk = df.groupby(['Team', 'Position'])

Print the first value in each group

gkk.first()

import pandas as pd import numpy as np

Create a sample dataframe

df = pd.DataFrame({"dept": np.random.choice(["IT", "HR", "Sales", "Production"], size=50), "gender": np.random.choice(["F", "M"], size=50), "age": np.random.randint(22, 60, size=50), "salary": np.random.randint(20000, 90000, size=50)}) df.index.name = "emp_id"

Calculate mean data of gender groups

df.groupby('gender').mean().add_prefix('mean_')

Output:

Example 2: Performing multiple aggregate operations using the aggregate function

( DataFrameGroupBy.agg ) which accepts a string, function or a list of functions.

Import required libraries

import pandas as pd import numpy as np

Create a sample dataframe

df = pd.DataFrame({"dept": np.random.choice(["IT", "HR", "Sales", "Production"], size=50), "gender": np.random.choice(["F", "M"], size=50), "age": np.random.randint(22, 60, size=50), "salary": np.random.randint(20000, 90000, size=50)})

df.index.name = "emp_id"

Calculate min, max, mean and count of salaries

in different departments for males and females

df.groupby(['dept', 'gender'])['salary'].agg(["min", "max", "mean", "count"])

Output:

Example 3: Specifying multiple columns and their corresponding aggregate operations as

follows.

Import required libraries

import pandas as pd import numpy as np

Create a sample dataframe

df = pd.DataFrame({"dept": np.random.choice(["IT", "HR", "Sales", "Production"], size=50), "gender": np.random.choice(["F", "M"], size=50), "age": np.random.randint(22, 60, size=50), "salary": np.random.randint(20000, 90000, size=50)}) df.index.name = "emp_id"

Calculate mean salaries and min-max age of employees

in different departments for gender groups

Create bins or groups and apply operations

The cut method of Pandas sorts values into bin intervals creating groups or categories.

Aggregation or other functions can then be performed on these groups. Implementation of

this is shown below:

Example : Age is divided into age ranges and the count of observations in the sample data is

calculated.

Import required libraries

import pandas as pd import numpy as np

Create a sample dataframe

df = pd.DataFrame({"dept": np.random.choice(["IT", "HR", "Sales", "Production"], size=50), "gender": np.random.choice(["F", "M"], size=50), "age": np.random.randint(22, 60, size=50), "salary": np.random.randint(20000, 90000, size=50)}) df.index.name = "emp_id"

Create bin intervals

bins = [20, 30, 45, 60]

Segregate ages into bins of age groups

df['categories'] = pd.cut(df['age'], bins, labels=['Young', 'Middle', 'Old'])

Calculate number of observations in each age category

df['age'].groupby(df['categories']).count()

Output:

Transformation

Transformation is performing a group-specific operation where the individual values are

changed while the shape of the data remains same. We use the transform() function to do so.

Example :

Import required libraries

import pandas as pd import numpy as np

Create a sample dataframe

df = pd.DataFrame({"dept": np.random.choice(["IT", "HR", "Sales", "Production"], size=50), "gender": np.random.choice(["F", "M"], size=50), "age": np.random.randint(22, 60, size=50), "salary": np.random.randint(20000, 90000, size=50)}) df.index.name = "emp_id"

Calculate mean difference by transforming each salary value

df['mean_sal_diff'] = df['salary'].groupby( df['dept']).transform(lambda x: x - x.mean()) df.head()

Output: