- PySpark Cookbook
- Denny Lee Tomasz Drabas
- 242字
- 2025-04-04 16:35:18
.reduce(...) action
The reduce(f) action aggregates the elements of an RDD by f. The f function should be commutative and associative so that it can be computed correctly in parallel. Look at the following code:
# Calculate the total delays of flights
# between SEA (origin) and SFO (dest),
# convert delays column to int
# and summarize
flights\
.filter(lambda c: c[3] == 'SEA' and c[4] == 'SFO')\
.map(lambda c: int(c[1]))\
.reduce(lambda x, y: x + y)
This will produce the following result:
# Output
22293
We need to make an important note here, however. When using reduce(), the reducer function needs to be associative and commutative; that is, a change in the order of elements and operands does not change the result.
Associativity rule: (6 + 3) + 4 = 6 + (3 + 4)
Commutative rule: 6 + 3 + 4 = 4 + 3 + 6
Error can occur if you ignore the aforementioned rules.
As an example, see the following RDD (with one partition only!):
data_reduce = sc.parallelize([1, 2, .5, .1, 5, .2], 1)
Reducing data to pide the current result by the subsequent one, we would expect a value of 10:
works = data_reduce.reduce(lambda x, y: x / y)
Partitioning the data into three partitions will produce an incorrect result:
data_reduce = sc.parallelize([1, 2, .5, .1, 5, .2], 3) data_reduce.reduce(lambda x, y: x / y)
It will produce 0.004.