Last updated: Apr 12, 2024
Reading timeΒ·5 min
To find the length of the longest string in a DataFrame
column:
str.len()
method to get the length of each value.max()
method on the result.import pandas as pd df = pd.DataFrame({ 'A': ['A', 'AB', 'ABC'], 'B': ['BC', 'BCD', 'BCDE'], }) print(df) print('-' * 50) print(df['A'].str.len().max()) # ποΈ 3 print('-' * 50) print(df['B'].str.len().max()) # ποΈ 4
Running the code sample produces the following output.
A B 0 A BC 1 AB BCD 2 ABC BCDE -------------------------------------------------- 3 --------------------------------------------------
We used bracket notation to get the column for which we want to find the max length.
The
str.len()
method computes the length of each value in the DataFrame
.
import pandas as pd df = pd.DataFrame({ 'A': ['A', 'AB', 'ABC'], 'B': ['BC', 'BCD', 'BCDE'], }) # 0 1 # 1 2 # 2 3 # Name: A, dtype: int64 print(df['A'].str.len())
The last step is to use the Series.max() method to get the max length.
import pandas as pd df = pd.DataFrame({ 'A': ['A', 'AB', 'ABC'], 'B': ['BC', 'BCD', 'BCDE'], }) # A B # 0 A BC # 1 AB BCD # 2 ABC BCDE print(df) print('-' * 50) print(df['A'].str.len().max()) # ποΈ 3 print('-' * 50) print(df['B'].str.len().max()) # ποΈ 4
Use the max()
function if you need to find the longest string in a DataFrame
column.
The function can be passed a key
argument that specifies a one-argument
ordering function like the one used for list.sort()
.
import pandas as pd df = pd.DataFrame({ 'A': ['A', 'AB', 'ABC'], 'B': ['BC', 'BCD', 'BCDE'], }) # A B # 0 A BC # 1 AB BCD # 2 ABC BCDE print(df) print('-' * 50) print(max(df['A'], key=len)) # ποΈ ABC print('-' * 50) print(max(df['B'], key=len)) # ποΈ BCDE
The max() function returns the largest item in an iterable or the largest of two or more arguments.
We set the key
argument to the len()
function to compare the length of the
strings in the given column.
The len() function returns the length (the number of items) of an object.
map()
You can also use the
map()
method to find the length of the longest string in a DataFrame
column.
import pandas as pd df = pd.DataFrame({ 'A': ['A', 'AB', 'ABC'], 'B': ['BC', 'BCD', 'BCDE'], }) print(df) print('-' * 50) print(df['A'].map(len).max()) # ποΈ 3 print('-' * 50) print(df['B'].map(len).max()) # ποΈ 4 print('-' * 50)
Running the code sample produces the following output.
A B 0 A BC 1 AB BCD 2 ABC BCDE -------------------------------------------------- 3 -------------------------------------------------- 4 --------------------------------------------------
The map()
method calls the supplied function with each item of the iterable.
We passed the len
function to map
, so the length of each string in the
column is returned.
import pandas as pd df = pd.DataFrame({ 'A': ['A', 'AB', 'ABC'], 'B': ['BC', 'BCD', 'BCDE'], }) # 0 1 # 1 2 # 2 3 # Name: A, dtype: int64 print(df['A'].map(len))
The last step is to call the max()
method on the result.
print(df['A'].map(len).max()) # ποΈ 3
In some cases, you might want to get the length of the longest string in bytes.
This is useful if the strings in your DataFrame
contain non-ASCII characters
that are represented by multiple bytes.
import pandas as pd df = pd.DataFrame({ 'A': ['A', 'AB', 'ABCΓ€'], 'B': ['BC', 'BCD', 'BCDEΓΆ'], }) # ποΈ 5 print(int(df['A'].str.encode('utf-8').str.len().max())) # ποΈ 6 print(int(df['B'].str.encode('utf-8').str.len().max()))
The str.encode() method returns an encoded version of the string as a bytes object.
We used the utf-8
encoding in the example because some of the strings in the
DataFrame
contain umlauts and they cannot be handled using the ASCII encoding.
If you need to get the index of the longest string in a DataFrame column, use the DataFrame.idxmax() method.
import pandas as pd df = pd.DataFrame({ 'A': ['A', 'AB', 'ABC'], 'B': ['BC', 'BCD', 'BCDE'], }) longest_index = df['A'].str.len().idxmax() print(longest_index) # ποΈ 2 print(df['A'][longest_index]) # ποΈ ABC
The idxmax()
method returns the index of the first occurrence of the maximum
value over the requested axis.
By default, the axis
argument is set to 0
(row-wise).
Notice that indices are zero-based, so the index of the first value in the
DataFrame
column is zero and the index of the third is 2
.
Once you have the index of the longest string in the column, use bracket notation to get the corresponding value.
If you need to get the maximum length of each column in a DataFrame
, use the
numpy.vectorize()
method.
import pandas as pd import numpy as np df = pd.DataFrame({ 'A': ['A', 'AB', 'ABC'], 'B': ['BC', 'BCD', 'BCDE'], 'C': [1, 12, 12345], }) vfunc = np.vectorize(len) result = vfunc(df.values.astype(str)).max(axis=0) print(result) # ποΈ [3 4 5]
The numpy.vectorize()
method returns an object that acts as a function but
takes arrays as input.
Make sure you
have the numpy
module installed to be
able to run the code sample.
pip install numpy # or with pip3 pip3 install numpy
The code sample shows how to find the maximum length of each column.
If you need to only get the length of object
columns, use the
DataFrame.select_types()
method.
import pandas as pd import numpy as np df = pd.DataFrame({ 'A': ['A', 'AB', 'ABC'], 'B': ['BC', 'BCD', 'BCDE'], 'C': [1, 12, 12345], }) vfunc = np.vectorize(len) result = vfunc(df.select_dtypes( include=[object]).values.astype(str)).max(axis=0) print(result) # ποΈ [3 4]
Notice that the max length of the int
column ("C"
) is not included in the
result.
If you need to construct a mapping that contains the column names and the
maximum lengths, use the dict
class.
import pandas as pd import numpy as np df = pd.DataFrame({ 'A': ['A', 'AB', 'ABC'], 'B': ['BC', 'BCD', 'BCDE'], 'C': [1, 12, 12345], }) vfunc = np.vectorize(len) a_dict = dict(zip(df, vfunc(df.values.astype(str)).max(axis=0))) # ποΈ {'A': 3, 'B': 4, 'C': 5} print(a_dict)
The zip function iterates over several iterables in parallel and produces tuples with an item from each iterable.
import pandas as pd import numpy as np df = pd.DataFrame({ 'A': ['A', 'AB', 'ABC'], 'B': ['BC', 'BCD', 'BCDE'], 'C': [1, 12, 12345], }) vfunc = np.vectorize(len) # ποΈ [('A', 3), ('B', 4), ('C', 5)] print(list(zip(df, vfunc(df.values.astype(str)).max(axis=0))))
You can learn more about the related topics by checking out the following tutorials: