6. Transformations and Actions
"""
Transformations (Lazy):
- map()
- filter()
- flatMap()
- union()
Actions (Eager):
- collect()
- count()
- take()
- first()
"""
7. Reading CSV File
# Basic read
df = spark.read.csv("path/to/file.csv")
# With options
df = spark.read \
.option("header", "true") \
.option("inferSchema", "true") \
.csv("path/to/file.csv")
8. Narrow vs Wide Transformations
"""
Narrow Transformations:
- One-to-one partition dependency
- Examples: map(), filter()
- No shuffle required
Wide Transformations:
- Many-to-many partition dependency
- Examples: groupByKey(), reduceByKey()
- Requires shuffle
"""
9. Lazy Evaluation
"""
PySpark uses lazy evaluation:
- Transformations are not executed immediately
- Actions trigger execution
- Creates DAG of operations
- Optimizes execution plan
"""
10. select() vs withColumn() vs selectExpr()
# select()
df.select("name", "age")
# withColumn()
df.withColumn("age_doubled", col("age") * 2)
# selectExpr()
df.selectExpr("name", "age * 2 as age_doubled")
11. Filtering Rows
# Using filter
df.filter(df.age > 25)
# Using where
df.where(df.age > 25)
# Multiple conditions
df.filter((df.age > 25) & (df.salary >= 50000))
12. Handling Missing Values
# Drop nulls
df.dropna()
# Fill nulls
df.fillna(0)
df.fillna({"age": 0, "name": "Unknown"})