- PySpark Cookbook
- Denny Lee Tomasz Drabas
- 56字
- 2025-04-04 16:35:18
Getting ready
As in the previous sections, let's make use of the flights dataset and create an RDD and a DataFrame against this dataset:
## Create flights RDD
flights = sc.textFile('/databricks-datasets/flights/departuredelays.csv')\
.map(lambda line: line.split(","))\
.zipWithIndex()\
.filter(lambda (row, idx): idx > 0)\
.map(lambda (row, idx): row)
# Create flightsDF DataFrame
flightsDF = spark.read\
.options(header='true', inferSchema='true')
.csv('~/data/flights/departuredelays.csv')
flightsDF.createOrReplaceTempView("flightsDF")