Sparklyr/Hive: how to use regex (regexp_replace) correctly?

Question

Consider the following example

dataframe_test<- data_frame(mydate = c('2011-03-01T00:00:04.226Z', '2011-03-01T00:00:04.226Z'))

# A tibble: 2 x 1
                    mydate
                     <chr>
1 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z

sdf <- copy_to(sc, dataframe_test, overwrite = TRUE)

> sdf
# Source:   table<dataframe_test> [?? x 1]
# Database: spark_connection
                    mydate
                     <chr>
1 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z

I would like to modify the character timestamp so that it has a more conventional format. I tried to do so using regexp_replace but it fails.

> sdf <- sdf %>% mutate(regex = regexp_replace(mydate, '(\d{4})-(\d{2})-(\d{2})T(\d{2}):(\d{2}):(\d{2}).(\d{3})Z', '$1-$2-$3 $4:$5:$6.$7'))
> sdf
# Source:   lazy query [?? x 2]
# Database: spark_connection
                    mydate                    regex
                     <chr>                    <chr>
1 2011-03-01T00:00:04.226Z 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z 2011-03-01T00:00:04.226Z

Any ideas? What is the correct syntax?

zero323 · Accepted Answer

Spark SQL and Hive provide two different functions:

regexp_extract - which takes string, pattern and the index of the group to be extracted.
regexp_replace - which takes a string, pattern, and the replacement string.

The former one can be used to extract a single group with the index semantics being the same as for java.util.regex.Matcher

For regexp_replace pattern has to match a whole string and if there is no match, and the input string is returned:

sdf %>% mutate(
 regex = regexp_replace(mydate, '^([0-9]{4}).*', "$1"),
 regexp_bad = regexp_replace(mydate, '([0-9]{4})', "$1"))

## Source:   query [2 x 3]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
## # A tibble: 2 x 3
##                     mydate regex               regexp_bad
##                      <chr> <chr>                    <chr>
## 1 2011-03-01T00:00:04.226Z  2011 2011-03-01T00:00:04.226Z
## 2 2011-03-01T00:00:04.226Z  2011 2011-03-01T00:00:04.226Z

while with regexp_extract it is not required:

sdf %>% mutate(regex = regexp_extract(mydate, '([0-9]{4})', 1))

## Source:   query [2 x 2]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
## # A tibble: 2 x 2
##                     mydate regex
##                      <chr> <chr>
## 1 2011-03-01T00:00:04.226Z  2011
## 2 2011-03-01T00:00:04.226Z  2011

Also, due to indirect execution (R -> Java), you have to escape twice:

sdf %>% mutate(
  regex = regexp_replace(
    mydate, 
    '^(\\d{4})-(\\d{2})-(\\d{2})T(\\d{2}):(\\d{2}):(\\d{2}).(\\d{3})Z$',
    '$1-$2-$3 $4:$5:$6.$7'))

Normally one would use Spark datetime functions:

spark_session(sc) %>%  
  invoke("sql",
    "SELECT *, DATE_FORMAT(CAST(mydate AS timestamp), 'yyyy-MM-dd HH:mm:ss.SSS') parsed from dataframe_test") %>% 
  sdf_register


## Source:   query [2 x 2]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
## # A tibble: 2 x 2
##                     mydate                  parsed
##                      <chr>                   <chr>
## 1 2011-03-01T00:00:04.226Z 2011-03-01 01:00:04.226
## 2 2011-03-01T00:00:04.226Z 2011-03-01 01:00:04.226

but sadly sparklyr seems to be extremely limited in this area, and treats timestamps as strings.

See also change string in DF using hive command and mutate with sparklyr.

SébastienQUINAULT · Answer

I had some difficulties to replace "." with "", but finally it works with:

mutate(myvar2=regexp_replace(myvar, "[.]", ""))

Sparklyr/Hive: how to use regex (regexp_replace) correctly?

Tags:

r

apache-spark

hive

sparklyr

ℕʘʘḆḽḘ

2 Answers

zero323

SébastienQUINAULT

Recent Activity

Donate For Us

Sparklyr/Hive: how to use regex (regexp_replace) correctly?

Tags:

r

apache-spark

hive

sparklyr

ℕʘʘḆḽḘ

2 Answers

zero323

SébastienQUINAULT

Related questions

Recent Activity

Donate For Us