How to get the weekday from day of month using pyspark

Tags:

I have a dataframe log_df: enter image description here

I generate a new dataframe based on the following code:

from pyspark.sql.functions import split, regexp_extract 
split_log_df = log_df.select(regexp_extract('value', r'^([^\s]+\s)', 1).alias('host'),
                          regexp_extract('value', r'^.*\[(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]', 1).alias('timestamp'),
                          regexp_extract('value', r'^.*"\w+\s+([^\s]+)\s+HTTP.*"', 1).alias('path'),
                          regexp_extract('value', r'^.*"\s+([^\s]+)', 1).cast('integer').alias('status'),
                          regexp_extract('value', r'^.*\s+(\d+)$', 1).cast('integer').alias('content_size'))
split_log_df.show(10, truncate=False)

the new dataframe is like: enter image description here

I need another column showing the dayofweek, what would be the best elegant way to create it? ideally just adding a udf like field in the select.

Thank you very much.

Updated: my question is different than the one in the comment, what I need is to make the calculation based on a string in log_df, not based on the timestamp like the comment, so this is not a duplicate question. Thanks.

498

asked Aug 13 '16 03:08

mdivk

1 Answers

Since Spark 2.3 you can use the dayofweek function https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.dayofweek.html

from pyspark.sql.functions import dayofweek
df.withColumn('day_of_week', dayofweek('my_timestamp'))

However this defines the start of the week as a Sunday = 1

If you don't want that, but instead require Monday = 1, then you could do an inelegant fudge like either subtracting 1 day before using the dayofweek function or amend the result such as like this

from pyspark.sql.functions import dayofweek
df.withColumn('day_of_week', ((dayofweek('my_timestamp')+5)%7)+1)

110

answered Oct 19 '22 15:10

Graeme Tate

Related questions
                            
                                Multiple SparkContext detected in the same JVM
                            
                                How can I sum multiple columns in a spark dataframe in pyspark?
                            
                                How to set column names to toDF() function in spark dataframe using a string array?
                            
                                Creating a row number of each row in PySpark DataFrame using row_number() function with Spark version 2.2
                            
                                What is the Scala type mapping for all Spark SQL DataType
                            
                                Spark job in Java: how to access files from 'resources' when run on a cluster
                            
                                How to copy and convert parquet files to csv
                            
                                Create array of literals and columns from List of Strings in Spark SQL
                            
                                How to convert Row to json in Spark 2 Scala
                            
                                Compare in-memory cluster computing systems
                            
                                In Spark Dataframe how to get duplicate records and distinct records in two dataframes?
                            
                                Find out the partition no/id
                            
                                Spark SPARK_PUBLIC_DNS and SPARK_LOCAL_IP on stand-alone cluster with docker containers
                            
                                How can I create a Spark DataFrame from a nested array of struct element?
                            
                                How to lower the case of column names of a data frame but not its values?
                            
                                Spark: Trying to run spark-shell, but get 'cmd' is not recognized as an internal or
                            
                                How to convert the datasets of Spark Row into string?
                            
                                Converting JavaRDD to DataFrame in Spark java
                            
                                sbt got error when run Spark hello world code?
                            
                                Spark: FlatMapValues query

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get the weekday from day of month using pyspark

Tags:

apache-spark

pyspark

dayofweek

mdivk

People also ask

1 Answers

Graeme Tate

Recent Activity

Donate For Us