PYSPARK BASIC CONCEPTS PART-2

6. Transformations and Actions

"""
Transformations (Lazy):
- map()
- filter()
- flatMap()
- union()

Actions (Eager):
- collect()
- count()
- take()
- first()
"""

7. Reading CSV File

# Basic read
df = spark.read.csv("path/to/file.csv")

# With options
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("path/to/file.csv")

8. Narrow vs Wide Transformations

"""
Narrow Transformations:
- One-to-one partition dependency
- Examples: map(), filter()
- No shuffle required

Wide Transformations:
- Many-to-many partition dependency
- Examples: groupByKey(), reduceByKey()
- Requires shuffle
"""

9. Lazy Evaluation

"""
PySpark uses lazy evaluation:
- Transformations are not executed immediately
- Actions trigger execution
- Creates DAG of operations
- Optimizes execution plan
"""

10. select() vs withColumn() vs selectExpr()

# select()
df.select("name", "age")

# withColumn()
df.withColumn("age_doubled", col("age") * 2)

# selectExpr()
df.selectExpr("name", "age * 2 as age_doubled")

11. Filtering Rows

# Using filter
df.filter(df.age > 25)

# Using where
df.where(df.age > 25)

# Multiple conditions
df.filter((df.age > 25) & (df.salary >= 50000))

12. Handling Missing Values

# Drop nulls
df.dropna()

# Fill nulls
df.fillna(0)
df.fillna({"age": 0, "name": "Unknown"})

Leave a Comment