- PySpark Cookbook
- Denny Lee Tomasz Drabas
- 153字
- 2025-04-04 16:35:18
.zipWithIndex() transformation
The zipWithIndex() transformation appends (or ZIPs) the RDD with the element indices. This is very handy when wanting to remove the header row (first row) of a file.
Look at the following code snippet:
# View each row within RDD + the index
# i.e. output is in form ([row], idx)
ac = airports.map(lambda c: (c[0], c[3]))
ac.zipWithIndex().take(5)
This will generate this result:
# Output
[((u'City', u'IATA'), 0),
((u'Abbotsford', u'YXX'), 1),
((u'Aberdeen', u'ABR'), 2),
((u'Abilene', u'ABI'), 3),
((u'Akron', u'CAK'), 4)]
To remove the header from your data, you can use the following code:
# Using zipWithIndex to skip header row
# - filter out row 0
# - extract only row info
(
ac
.zipWithIndex()
.filter(lambda (row, idx): idx > 0)
.map(lambda (row, idx): row)
.take(5)
)
The preceding code will skip the header, as shown as follows:
# Output
[(u'Abbotsford', u'YXX'),
(u'Aberdeen', u'ABR'),
(u'Abilene', u'ABI'),
(u'Akron', u'CAK'),
(u'Alamosa', u'ALS')]