Codied To Clipboard !
Home > Notes > Python Pandas
Any groupby operation involves one of the following operations on the original object. They are − Splitting the Object Applying a function Combining the results In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations − Aggregation − computing a summary statistic Transformation − perform some group-specific operation Filtration − discarding the data with some condition Let us now create a DataFrame object and perform all the operations on it −
#import the pandas library import pandas as pd print df ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data)
Pandas object can be split into any of their objects. There are multiple ways to split an object like − obj.groupby(‘key’) obj.groupby([‘key1′,’key2’]) obj.groupby(key,axis=1) Let us now see how the grouping objects can be applied to the DataFrame object
# import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print df.groupby('Team')
View Groups Example is shown below.
import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print df.groupby('Team').groups
Group by with multiple Column Example is shown below.
import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print df.groupby(['Team','Year']).groups
With the groupby object in hand, we can iterate through the object similar to itertools.obj.
# import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') for name,group in grouped: print name print group
Using the get_group() method, we can select a single group.
# import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') print grouped.get_group(2014)
An aggregated function returns a single aggregated value for each group. Once the group by object is created, several aggregation operations can be performed on the grouped data. An obvious one is aggregation via the aggregate or equivalent agg method −
# import the pandas library import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data)
Any groupby operation involves one of the following operations on the original object. They are − Splitting the Object Applying a function Combining the results In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations − Aggregation − computing a summary statistic Transformation − perform some group-specific operation Filtration − discarding the data with some condition Let us now create a DataFrame object and perform all the operations on it −
#import the pandas library import pandas as pd print df ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') print grouped['Points'].agg(np.mean)
Another way to see the size of each group is by applying the size() function −
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) Attribute Access in Python Pandas grouped = df.groupby('Team') print grouped.agg(np.size)
With grouped Series, you can also pass a list or dict of functions to do aggregation with, and generate DataFrame as output
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Team') print grouped['Points'].agg([np.sum, np.mean, np.std])
Transformation on a group or a column returns an object that is indexed the same size of that is being grouped. Thus, the transform should return a result that is the same size as that of a group chunk.
# import the pandas library import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Team') score = lambda x: (x - x.mean()) / x.std()*10 print grouped.transform(score)
Filtration filters the data on a defined criteria and returns the subset of data. The filter() function is used to filter the data.
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print df.groupby('Team').filter(lambda x: len(x) >= 3)
groupby Any groupby operation involves one of the following operations on the original object. They are − Splitting the Object Applying a function Combining the results In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations − Aggregation − computing a summary statistic Transformation − perform some group-specific operation Filtration − discarding the data with some condition Let us now create a DataFrame object and perform all the operations on it −
#import the pandas library import pandas as pd print df ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data)
Pandas object can be split into any of their objects. There are multiple ways to split an object like − obj.groupby(‘key’) obj.groupby([‘key1′,’key2’]) obj.groupby(key,axis=1) Let us now see how the grouping objects can be applied to the DataFrame object
# import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print df.groupby('Team')
# import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print df.groupby('Team').groups
Group by with multiple Column Example is shown below.
# import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print df.groupby(['Team','Year']).groups
With the groupby object in hand, we can iterate through the object similar to itertools.obj.
# import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') for name,group in grouped: print name print group
Using the get_group() method, we can select a single group
# import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') print grouped.get_group(2014)
An aggregated function returns a single aggregated value for each group. Once the group by object is created, several aggregation operations can be performed on the grouped data. An obvious one is aggregation via the aggregate or equivalent agg method −
# import the pandas library import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') print grouped['Points'].agg(np.mean)
With grouped Series, you can also pass a list or dict of functions to do aggregation with, and generate DataFrame as output
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Team') print grouped['Points'].agg([np.sum, np.mean, np.std])
Transformation on a group or a column returns an object that is indexed the same size of that is being grouped. Thus, the transform should return a result that is the same size as that of a group chunk.
# import the pandas library import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Team') score = lambda x: (x - x.mean()) / x.std()*10 print grouped.transform(score)
Filtration filters the data on a defined criteria and returns the subset of data. The filter() function is used to filter the data.
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015, 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print df.groupby('Team').filter(lambda x: len(x) >= 3)