I have a dataframe log_df:
I generate a new dataframe based on the following code:
from pyspark.sql.functions import split, regexp_extract
split_log_df = log_df.select(regexp_extract('value', r'^([^\s]+\s)', 1).alias('host'),
regexp_extract('value', r'^.*\[(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]', 1).alias('timestamp'),
regexp_extract('value', r'^.*"\w+\s+([^\s]+)\s+HTTP.*"', 1).alias('path'),
regexp_extract('value', r'^.*"\s+([^\s]+)', 1).cast('integer').alias('status'),
regexp_extract('value', r'^.*\s+(\d+)$', 1).cast('integer').alias('content_size'))
split_log_df.show(10, truncate=False)
the new dataframe is like:
I need another column showing the dayofweek, what would be the best elegant way to create it? ideally just adding a udf like field in the select.
Thank you very much.
Updated: my question is different than the one in the comment, what I need is to make the calculation based on a string in log_df, not based on the timestamp like the comment, so this is not a duplicate question. Thanks.
Returns the last day of the month which the given date belongs to. New in version 1.5.
In PySpark, the withColumn() function is widely used and defined as the transformation function of the DataFrame which is further used to change the value, convert the datatype of an existing column, create the new column etc.
Solution: Using the Spark SQL date_format() function along with date formatting patterns, we can extract a day of the year and week of the year from a Date & Timestamp columns.
This example uses Date formatting patterns to extracts Day of the week from the Spark date and timestamp DataFrame columns and the value would be between 1 to 7 where 1 is for Monday and 7 is for Sunday. E – date formatting pattern is used to get the day of the week in 3 characters for example 'Mon' for Monday.
Since Spark 2.3 you can use the dayofweek function https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.dayofweek.html
from pyspark.sql.functions import dayofweek
df.withColumn('day_of_week', dayofweek('my_timestamp'))
However this defines the start of the week as a Sunday = 1
If you don't want that, but instead require Monday = 1, then you could do an inelegant fudge like either subtracting 1 day before using the dayofweek function or amend the result such as like this
from pyspark.sql.functions import dayofweek
df.withColumn('day_of_week', ((dayofweek('my_timestamp')+5)%7)+1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With