I have a data frame with a timestamp field - RECEIPTDATEREQUESTED:timestamp For some reason, there are dates that are less than 1900-01-01. I don't want these, what I want to do, is for every value in the column of the dataframe where the RECEIPTDATEREQUESTED<'1900-01-01 00:00:00' then set the timestamp to either 1900-01-01 or null. I've tried a few ways to do this, but it seems some more simple must exist. I thought something like this might work, but
import datetime
def testdate(date_value):
oldest = datetime.datetime.strptime('1900-01-01 00:00:00', '%Y-%m-%d')
try:
if (date_value < oldest):
return oldest
else:
return date_value
except ValueError:
return oldest
udf_testdate = udf(lambda x:testdate(x),TimestampType())
bdf = olddf.withColumn("RECEIPTDATEREQUESTED",udf_testdate(col("RECEIPTDATEREQUESTED")))