How can I run an aggregation on each Dataframe partition without shuffle in Apache Spark? - Stack Overflow

admin2025-04-09 2

I want to call the ntile function in Spark on my Dataframe for each Dataframe partition:

.sql/api/pyspark.sql.functions.ntile.html

I want to avoid any shuffling for performance reasons, which I should be able to do because it should be able to calculate the ntile value for each partition of data independently.

So my "window" isn't a column, it's the partition number.

This article suggests I should add partition_id column and use that: /@aalopatin/row-number-in-spark-is-simple-but-there-are-nuances-a7c9099e55dc

This functionally works because I can then create a window over the partition_id column, but in testing it looks from the Spark GUI that it is doing shuffling (presumably because it isn't able to determine that it shouldn't need to).

I'm aware I could write custom code using mapPartitions, but I'd rather use Spark's built-in functions if I can.

Is there any way I can create a Window over partition id and for Spark to know it doesn't need to do any shuffling?

This is my code currently which works but appears (at least from the Spark GUI) to be doing shuffling:

val dfWithPartition = df.withColumn("partition_id", spark_partition_id())

val window = Window.partitionBy("partition_id").orderBy("ts")

val dfWithNTile = dfWithPartition.withColumn("ntile_num", ntile(10).over(window))

转载请注明原文地址:http://conceptsofalgorithm.com/Algorithm/1744203779a235923.html

How can I run an aggregation on each Dataframe partition without shuffle in Apache SparkStack Overflow

最新回复(0)