Spark select column returns reference of old dataframe

Question 1

I use the following code:

random = [("ABC",xx, 1), 
          ("DEF",yy,1), 
          ("GHI",zz, 0) 
         ]
randomColumns = ["name","id", "male"]
randomDF = spark.createDataFrame(data=random, schema = randomColumns)
test_df = randomDF.select("name", "id")
test_df.filter(f.col("male") == '1').show()

From the above code I expect it to result in an error because for the test_df i dont select the male column from the original dataframe. Surprisingly the above query runs just fine without any error and outputs the following:

+---------+-------+
|name     |     id|
+---------+-------+
|      abc|     xx|
|      def|     yy|
+---------+-------+

I want to understand the logic behind what spark is doing. As per the spark documentation Select returns a new dataframe. Then why is it still able to use the male column from the parent dataframe.

Question 2

This is caused by the DAG generated by Spark. Some operators (or transformers) is are lazy-executed, so they pave the way for Spark to optimize the DAG.

In this example, there are two major steps: select (or project in SQL's jargon） first, and filter later. But in fact, when executing, filter first, and then select, because it is more efficient.

You can verify this conclusion by explain() method:

test_df.filter(f.col("flag") == '1').explain()

It will output:

== Physical Plan ==
*(1) Project [dept_name#0, dept_id#1L]
+- *(1) Filter (isnotnull(flag#2L) AND (flag#2L = 1))
   +- *(1) Scan ExistingRDD[dept_name#0,dept_id#1L,flag#2L]

Question 3

Adding to @chenzhongpu 's answer, please note that if you define a temp view on top of your test_df, the query will fail:

test_df.createOrReplaceTempView("test_df")
spark.sql("select * from test_df where flag = 1").show()
_Traceback (most recent call last): ...
:
pyspark.sql.utils.AnalysisException: u"cannot resolve '`flag`' given input columns: [test_df.dept, test_df.id]; line 1 pos 24;
'Project [*]
 +- 'Filter ('flag = 1)
   +- SubqueryAlias `test_df`
      +- Project [dept#0, id#2L]
         +- LogicalRDD [dept#0, flag#1L, id#2L], false
 _

...because a select (=Project node in the execution plan) is going to precede filter (attempted via where clause).

chenzhongpu · Answer 1 · 2021-11-24T01:29:03