Last updated: Apr 12, 2024
Reading time·5 min
To get the first row of each group in a Pandas DataFrame
:
DataFrame.groupby()
method to group the DataFrame
.DataFrameGroupBy.first
method to get the first non-null entry of
each column.import pandas as pd df = pd.DataFrame({ 'Animal': ['Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog'], 'Max Speed': [25, 35, 40, 45, 55, 65] }) # Max Speed # Animal # Cat 25 # Dog 45 print(df.groupby('Animal').first())
The
DataFrame.groupby()
method groups a DataFrame
using one or more columns.
The method returns a DataFrameGroupBy
object that contains information about
the groups.
The last step is to call the DataFrameGroupBy.first() method.
The method computes the first non-null entry of each column.
first()
method returns a Series
or a DataFrame
containing the first non-null value within each group.If you need to reset the index, call the DataFrame.reset_index method.
import pandas as pd df = pd.DataFrame({ 'Animal': ['Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog'], 'Max Speed': [25, 35, 40, 45, 55, 65] }) # Animal Max Speed # 0 Cat 25 # 1 Dog 45 print(df.groupby('Animal').first().reset_index())
The DataFrame.reset_index()
method resets the index of the DataFrame
, so the
default index is used.
If you need to get the first N rows of each group in a Pandas DataFrame:
DataFrame.head()
method to return the first N rows of each group.import pandas as pd df = pd.DataFrame({ 'Animal': ['Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog'], 'Max Speed': [25, 35, 40, 45, 55, 65] }) # Animal Max Speed # 0 Cat 25 # 1 Cat 35 # 2 Dog 45 # 3 Dog 55 print(df.groupby('Animal').head(2).reset_index(drop=True))
The
DataFrame.head()
method returns the first N rows of the DataFrame
based on position.
The only argument the method takes is the number of rows to be selected.
We also called the reset_index
method with drop=True
in the example.
When the drop
argument is set to True
, the index is reset to the default
integer index.
The argument defaults to False
.
However, calling the reset_index()
method on the result is optional.
import pandas as pd df = pd.DataFrame({ 'Animal': ['Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog'], 'Max Speed': [25, 35, 40, 45, 55, 65] }) # Animal Max Speed # 0 Cat 25 # 1 Cat 35 # 3 Dog 45 # 4 Dog 55 print(df.groupby('Animal').head(2))
nth()
You can also use the
DataFrameGroupBy.nth
method to get the first row of each group in a DataFrame
.
import pandas as pd df = pd.DataFrame({ 'Animal': ['Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog'], 'Max Speed': [25, 35, 40, 45, 55, 65] }) # Animal Max Speed # 0 Cat 25 # 3 Dog 45 print(df.groupby('Animal').nth(0))
The DataFrameGroupBy.nth()
method returns the Nth row from each group.
The only parameter we passed to the method is the row index to be returned.
Notice that indices are zero-based, so the first row has an index of 0
, the
second an index of 1
, etc.
import pandas as pd df = pd.DataFrame({ 'Animal': ['Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog'], 'Max Speed': [25, 35, 40, 45, 55, 65] }) # Animal Max Speed # 1 Cat 35 # 4 Dog 55 print(df.groupby('Animal').nth(1))
We used an index of 1
, so the second row of each group in the DataFrame
is
returned.
first()
vs using nth(0)
to get the first row of each groupNote that there are some minor differences between using first()
and nth(0)
to get the first row of each group.
The nth(0)
approach returns the first row of each group regardless of what the
values in the row are.
On the other hand, the first()
method returns the first non-null (or
non-NaN) value in each column.
Here is an example that better illustrates how this works.
import pandas as pd import numpy as np df = pd.DataFrame({ 'Animal': ['Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog'], 'Max Speed': [np.nan, 35, 40, 45, 55, 65] }) # Max Speed # Animal # Cat 35.0 # Dog 45.0 print(df.groupby('Animal').first()) print('-' * 50) # Animal Max Speed # 0 Cat NaN # 3 Dog 45.0 print(df.groupby('Animal').nth(0))
The first row in the Max Speed
column is a NaN
value.
Calling the first()
method returned the first not-NaN row of each group.
On the other hand, calling nth(0)
simply returned the first row of each group,
without checking for NaN
.
drop_duplicates()
You can also use the
DataFrame.drop_duplicates()
method to get the first row of each column in a Pandas DataFrame
.
import pandas as pd df = pd.DataFrame({ 'Animal': ['Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog'], 'Max Speed': [25, 35, 40, 45, 55, 65] }) # Animal Max Speed # 0 Cat 25 # 3 Dog 45 print(df.drop_duplicates('Animal'))
The drop_duplicates()
method returns a DataFrame
object with the duplicate
rows removed.
By default, all of the columns are considered when identifying duplicates.
If you want to modify the original DataFrame
to only contain the first row of
each group, set the inplace
argument to True
.
import pandas as pd df = pd.DataFrame({ 'Animal': ['Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog'], 'Max Speed': [25, 35, 40, 45, 55, 65] }) df.drop_duplicates('Animal', inplace=True) # Animal Max Speed # 0 Cat 25 # 3 Dog 45 print(df)
When the inplace
argument is set to True
, the method returns None
and
modifies the original DataFrame
rather than creating a new one.
You can learn more about the related topics by checking out the following tutorials: