I use the following code:
random = [("ABC",xx, 1),
("DEF",yy,1),
("GHI",zz, 0)
]
randomColumns = ["name","id", "male"]
randomDF = spark.createDataFrame(data=random, schema = randomColumns)
test_df = randomDF.select("name", "id")
test_df.filter(f.col("male") == '1').show()
From the above code I expect it to result in an error because for the test_df i dont select the male column from the original dataframe. Surprisingly the above query runs just fine without any error and outputs the following:
+---------+-------+
|name | id|
+---------+-------+
| abc| xx|
| def| yy|
+---------+-------+
I want to understand the logic behind what spark is doing. As per the spark documentation Select returns a new dataframe. Then why is it still able to use the male column from the parent dataframe.