Last updated: Apr 12, 2024
Reading time·5 min
Use the numpy.array_split()
method to split a DataFrame
into chunks.
The method takes the DataFrame
and the number of chunks as parameters and
splits the DataFrame
.
First, make sure that
you've installed the numpy
module.
pip install numpy pandas # or with pip3 pip3 install numpy pandas
Now, import and use the module as follows.
import pandas as pd import numpy as np df = pd.DataFrame({ 'name': ['A', 'B', 'C', 'D', 'E', 'F'], 'experience': [1, 1, 5, 7, 7, 10], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5, 225.3], }) list_of_dataframes = np.array_split(df, 2) print(list_of_dataframes)
Running the code sample produces the following output.
[ name experience salary 0 A 1 175.1 1 B 1 180.2 2 C 5 190.3, name experience salary 3 D 7 205.4 4 E 7 210.5 5 F 10 225.3]
The
numpy.array_split()
method splits an array (or a DataFrame
) into multiple sub-arrays.
import numpy as np list_of_dataframes = np.array_split(df, 2)
The first argument we passed to the method is the DataFrame
and the second is
the number of chunks we want to get in the resulting list.
The method returns a list of DataFrames, so you can access a specific
DataFrame
at an index.
import pandas as pd import numpy as np df = pd.DataFrame({ 'name': ['A', 'B', 'C', 'D', 'E', 'F'], 'experience': [1, 1, 5, 7, 7, 10], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5, 225.3], }) list_of_dataframes = np.array_split(df, 2) # name experience salary # 0 A 1 175.1 # 1 B 1 180.2 # 2 C 5 190.3 print(list_of_dataframes[0])
The code sample accesses the first DataFrame
(index 0
).
You can also use a for loop to iterate over
the list of DataFrame
chunks.
import pandas as pd import numpy as np df = pd.DataFrame({ 'name': ['A', 'B', 'C', 'D', 'E', 'F'], 'experience': [1, 1, 5, 7, 7, 10], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5, 225.3], }) list_of_dataframes = np.array_split(df, 2) for DF in list_of_dataframes: print(DF) print('-' * 50)
Running the code sample produces the following output.
name experience salary 0 A 1 175.1 1 B 1 180.2 2 C 5 190.3 -------------------------------------------------- name experience salary 3 D 7 205.4 4 E 7 210.5 5 F 10 225.3 --------------------------------------------------
If you need to split the DataFrame
into N chunks, make sure to pass N
as the
second argument to numpy.array_split()
.
import pandas as pd import numpy as np df = pd.DataFrame({ 'name': ['A', 'B', 'C', 'D', 'E', 'F'], 'experience': [1, 1, 5, 7, 7, 10], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5, 225.3], }) list_of_dataframes = np.array_split(df, 3) print(list_of_dataframes)
The code sample splits the DataFrame
into 3 chunks.
Running the code sample produces the following output.
[ name experience salary 0 A 1 175.1 1 B 1 180.2, name experience salary 2 C 5 190.3 3 D 7 205.4, name experience salary 4 E 7 210.5 5 F 10 225.3]
If you need to split a DataFrame
every N rows, use the following reusable
function.
import pandas as pd df = pd.DataFrame({ 'name': ['A', 'B', 'C', 'D', 'E', 'F'], 'experience': [1, 1, 5, 7, 7, 10], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5, 225.3], }) def split_every_n_rows(dataframe, chunk_size=2): chunks = [] num_chunks = len(dataframe) // chunk_size + 1 for index in range(num_chunks): chunks.append(dataframe[index * chunk_size:(index+1) * chunk_size]) return chunks list_of_dataframes = split_every_n_rows(df, 2) print(list_of_dataframes)
Running the code sample produces the following output.
[ name experience salary 0 A 1 175.1 1 B 1 180.2, name experience salary 2 C 5 190.3 3 D 7 205.4, name experience salary 4 E 7 210.5 5 F 10 225.3]
The function splits the DataFrame
every chunk_size
rows (by default 2 rows).
The function returns a list of DataFrames.
You can access the list at a specific index to get a specific DataFrame
chunk
or you can iterate over the list to access each chunk.
DataFrame.iloc
You can also use the DataFrame.iloc
integer-based indexer to split a Pandas DataFrame
into chunks.
import math import pandas as pd df = pd.DataFrame({ 'name': ['A', 'B', 'C', 'D', 'E', 'F'], 'experience': [1, 1, 5, 7, 7, 10], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5, 225.3], }) def split_every_n_rows(dataframe, chunk_size=2): chunks = [] num_chunks = math.ceil(int(dataframe.shape[0] / chunk_size)) for index in range(0, dataframe.shape[0], num_chunks): chunks.append( dataframe.iloc[index:index + chunk_size] ) return chunks list_of_dataframes = split_every_n_rows(df, 2) print(list_of_dataframes)
Running the code sample produces the following output.
[ name experience salary 0 A 1 175.1 1 B 1 180.2, name experience salary 3 D 7 205.4 4 E 7 210.5]
The function takes a DataFrame
and the number of chunks as parameters and
returns a list containing the DataFrame
chunks.
By default, the function splits the DataFrame
into 2 chunks, however, you can
set the chunk_size
argument to any other value.
Here is a reusable function that splits a Pandas DataFrame
every N rows using
a list comprehension.
import pandas as pd df = pd.DataFrame({ 'name': ['A', 'B', 'C', 'D', 'E', 'F'], 'experience': [1, 1, 5, 7, 7, 10], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5, 225.3], }) def split_every_n_rows(dataframe, chunk_size=2): return [ dataframe[index:index + chunk_size] for index in range(0, df.shape[0], chunk_size) ] list_of_dataframes = split_every_n_rows(df, 2) print(list_of_dataframes)
Running the code sample produces the following output.
[ name experience salary 0 A 1 175.1 1 B 1 180.2, name experience salary 2 C 5 190.3 3 D 7 205.4, name experience salary 4 E 7 210.5 5 F 10 225.3]
The function splits the DataFrame
every N rows using a list comprehension.
It iterates over a range() object with a step
of chunk_size
and returns a list containing the DataFrame
chunks.
You can learn more about the related topics by checking out the following tutorials: