<p>How to I set missing values for multiple labelled vectors in a data frame. I am working with a survey dataset from spss. I am dealing with about 20 different variables, with the same missing values. So would like to find a way to use lapply() to make this work, but I can't. </p> <p>I actually can do this with base R via as.numeric() and then recode() but I'm intrigued by the possibilities of haven and the labelled class so I'd like to find a way to do this all in Hadley's tidyverse</p> <p>Roughly the variables of interest look like this. I am sorry if this is a basic question, but I find the help documentaiton associated with the haven and labelled packages just very unhelpful.</p> <pre class="prettyprint"><code>library(haven) library(labelled) v1<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=5, refused=6)) v2<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=5, refused=6)) v3<-data.frame(v1=v1, v2=v2) lapply(v3, val_labels) lapply(v3, function(x) set_na_values(x, c(5,6))) </code></pre>

<p>Ok, I think I understand now what you trying to do...</p> <p>i.e. Mark the labels, and the values as NA without removing the underlying imported data...</p> <p><strong><em>See addendum for a more detailed example that uses a public data file to show an example that harnesses <code>dplyr</code> to update multiple columns, labels...</em></strong></p> <h3>Proposed Solution</h3> <pre class="prettyprint"><code>df <- data_frame(s1 = c(1,2,2,2,5,6), s2 = c(1,2,2,2,5,6)) %>% set_value_labels(s1 = c(agree=1, disagree=2, dk=5, refused=6), s2 = c(agree=1, disagree=2, dk = tagged_na("5"), refused = tagged_na("6"))) %>% set_na_values(s2 = c(5,6)) val_labels(df) is.na(df$s1) is.na(df$s2) df </code></pre> <hr> <h3>Solution Result:</h3> <pre class="prettyprint"><code>> library(haven) > library(labelled) > library(dplyr) > df <- data_frame(s1 = c(1,2,2,2,5,6), s2 = c(1,2,2,2,5,6)) %>% + set_value_labels(s1 = c(agree=1, disagree=2, dk=5, refused=6), + s2 = c(agree=1, disagree=2, dk = tagged_na("5"), refused = tagged_na("6"))) %>% + set_na_values(s2 = c(5,6)) > val_labels(df) $s1 agree disagree dk refused 1 2 5 6 $s2 agree disagree dk refused 1 2 NA NA > is.na(df$s1) [1] FALSE FALSE FALSE FALSE FALSE FALSE > is.na(df$s2) [1] FALSE FALSE FALSE FALSE TRUE TRUE > df # A tibble: 6 × 2 s1 s2 <dbl+lbl> <dbl+lbl> 1 1 1 2 2 2 3 2 2 4 2 2 5 5 5 6 6 6 </code></pre> <h3>Now we can manipulate the data</h3> <pre class="prettyprint"><code>mean(df$s1, na.rm = TRUE) mean(df$s2, na.rm = TRUE) > mean(df$s1, na.rm = TRUE) [1] 3 > mean(df$s2, na.rm = TRUE) [1] 1.75 </code></pre> <h3>Use Labelled package to remove labels and replace with R NA</h3> <p>If you wish to strip the labels and replace with R NA values you can use <code>remove_labels(x, user_na_to_na = TRUE)</code></p> <h3>Example:</h3> <pre class="prettyprint"><code>df <- remove_labels(df, user_na_to_na = TRUE) df </code></pre> <h3>Result:</h3> <pre class="prettyprint"><code>> df <- remove_labels(df, user_na_to_na = TRUE) > df # A tibble: 6 × 2 s1 s2 <dbl> <dbl> 1 1 1 2 2 2 3 2 2 4 2 2 5 5 NA 6 6 NA </code></pre> <p>--</p> <h3>Explanation / Overview of SPSS Format:</h3> <p>IBM SPSS (The application) can import and export data in many formats and in non-rectangular configurations; however, the data set is always translated to an SPSS rectangular data file, known as a system file (using the extension *.sav). Metadata (information about the data) such as variable formats, missing values, and variable and value labels are stored with the dataset.</p> <h3>Value Labels</h3> <p>Base R has one data type that effectively maintains a mapping between integers and character labels: <em>the factor</em>. This, however, is not the primary use of factors: they are instead designed to automatically generate useful contrasts for linear models. Factors differ from the labelled values provided by the other tools in important ways:</p> <p>SPSS and SAS can label numeric and character values, not just integer values.</p> <h3>Missing Values</h3> <p>All three tools (SPSS, SAS, Stata) provide a global “system missing value” which is displayed as <code>.</code>. This is roughly equivalent to R’s <code>NA</code>, although neither Stata nor SAS propagate missingness in numeric comparisons: SAS treats the missing value as the smallest possible number (i.e. -inf), and Stata treats it as the largest possible number (i.e. inf).</p> <p>Each tool also provides a mechanism for recording multiple types of missingness:</p> <ul> <li>Stata has “extended” missing values, .A through .Z.</li> <li>SAS has “special” missing values, .A through .Z plus ._.</li> <li>SPSS has per-column “user” missing values. Each column can declare up to three distinct values or a range of values (plus one distinct value) that should be treated as missing.</li> </ul> <h3>User Defined Missing Values</h3> <p>SPSS’s user-defined values work differently to SAS and Stata. Each column can have either up to three distinct values that are considered as missing or a range. <code>Haven</code> provides <code>labelled_spss()</code> as a subclass of <code>labelled()</code> to model these additional user-defined missings.</p> <pre class="prettyprint"><code>x1 <- labelled_spss(c(1:10, 99), c(Missing = 99), na_value = 99) x2 <- labelled_spss(c(1:10, 99), c(Missing = 99), na_range = c(90, Inf)) x1 #> <Labelled SPSS double> #> [1] 1 2 3 4 5 6 7 8 9 10 99 #> Missing values: 99 #> #> Labels: #> value label #> 99 Missing x2 #> <Labelled SPSS double> #> [1] 1 2 3 4 5 6 7 8 9 10 99 #> Missing range: [90, Inf] #> #> Labels: #> value label #> 99 Missing </code></pre> <h3>Tagged missing values</h3> <p>To support Stata’s extended and SAS’s special missing value, haven implements a tagged NA. It does this by taking advantage of the internal structure of a floating point NA. That allows these values to behave identical to NA in regular R operations, while still preserving the value of the tag.</p> <p>The R interface for creating with tagged <code>NA</code>s is a little clunky because generally they’ll be created by haven for you. But you can create your own with tagged_na():</p> <h3>Important:</h3> <p>Note these tagged NAs behave identically to regular NAs, even when printing. To see their tags, use print_tagged_na():</p> <h3>Thus:</h3> <pre class="prettyprint"><code> library(haven) library(labelled) v1<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=5, refused=6)) v2<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=tagged_na("5"), refused= tagged_na("6"))) v3<-data.frame(v1 = v1, v2 = v2) v3 lapply(v3, val_labels) </code></pre> <hr> <pre class="prettyprint"><code>> v3 x x.1 1 1 1 2 2 2 3 2 2 4 2 2 5 5 5 6 6 6 > lapply(v3, val_labels) $x agree disagree dk refused 1 2 5 6 $x.1 agree disagree dk refused 1 2 NA NA </code></pre> <h3>Word of caution:</h3> <p>SPSS’s user-defined values work differently to SAS and Stata. Each column can have either up to three distinct values that are considered as missing, or a range. Haven provides <code>labelled_spss()</code> as a subclass of labelled() to model these additional user-defined missings.</p> <p>I hope the above helps</p> <p>Take care T.</p> <h3>References:</h3> <ul> <li>https://cran.r-project.org/web/packages/haven/haven.pdf</li> <li>https://cran.r-project.org/web/packages/haven/vignettes/semantics.html</li> <li>https://www.spss-tutorials.com/spss-missing-values-tutorial/</li> </ul> <h3>Addendum Example using Public Data...</h3> <h3>SPSS Missing Values Example using an SPPS Data file {hospital.sav}</h3> <p>Firstly, let's make sure we highlight that </p> <ul> <li> <strong>System missing values</strong> - are values that are completely absent from the data</li> <li> <strong>User missing values</strong> are values that are present in the data but must be excluded from calculations.</li> </ul> <h3>SPSS View of Data...</h3> <p>Let's review the image and the data... The SPSS data shown in the variable view shows that each row has a <strong>Label</strong> [Column5], we note that rows 10 through 14 have specific values attributed to them [1..6] [Column 6] that have name attributes and that no values have been specified as <strong>Missing</strong> [Column 7].</p> <p><img src="https://i.stack.imgur.com/BEBNP.png" alt="enter image description here"></p> <p>Now let's look at the SPSS data view:</p> <p>Here we can note that there is missing data... (See hilighted "."'s). The key point is that we have <strong>Missing data</strong>, but currently have no <strong>"Missing User Values"</strong></p> <p><img src="https://i.stack.imgur.com/zOI5Z.png" alt="enter image description here"></p> <h3>Now let's turn to R, and load the data into R</h3> <pre class="prettyprint"><code>hospital_url <- "https://www.spss-tutorials.com/downloads/hospital.sav" hospital <- read_sav(hospital_url, user_na = FALSE) head(hospital,5) # We're interested in columns 10 through 14... head(hospital[10:14],5) </code></pre> <h3>Result</h3> <pre class="prettyprint"><code>> hospital_url <- "https://www.spss-tutorials.com/downloads/hospital.sav" > hospital <- read_sav(hospital_url, + user_na = FALSE) > head(hospital,5) # A tibble: 5 × 14 visit_id patient_id first_name surname_prefix last_name gender entry_date entry_time <dbl> <dbl> <chr> <chr> <chr> <dbl+lbl> <date> <time> 1 32943 23176 JEFFREY DIJKSTRA 1 2013-01-08 16:56:10 2 32944 20754 MARK VAN DER BERG 1 2013-02-01 14:24:45 3 32945 25419 WILLEM VERMEULEN 1 2013-02-02 10:01:43 4 32946 21139 LINDA JANSSEN 0 2013-02-10 10:24:39 5 32947 25419 WILLEM VERMEULEN 1 2013-02-10 18:05:59 # ... with 6 more variables: exit_moment <dttm>, doctor_rating <dbl+lbl>, nurse_rating <dbl+lbl>, # room_rating <dbl+lbl>, food_rating <dbl+lbl>, facilities_rating <dbl+lbl> </code></pre> <h3>Columns 10 through 14 contain Values</h3> <pre class="prettyprint"><code>1="Very Dissatisfied" 2="Dissatisfied" 3="Neutral" 4="Satisfied" 5="Very Satisfied" 6="Not applicable or don't want to answer" </code></pre> <p>thus:</p> <pre class="prettyprint"><code>> head(hospital[10:14],5) # A tibble: 5 × 5 doctor_rating nurse_rating room_rating food_rating facilities_rating <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> 1 5 5 4 2 3 2 4 5 4 3 3 3 5 6 4 5 4 4 4 5 5 4 4 5 5 5 6 6 6 </code></pre> <h3>SPSS Value Labels</h3> <pre class="prettyprint"><code>> lapply(hospital[10], val_labels) $doctor_rating Very dissatisfied Dissatisfied 1 2 Neutral Satisfied 3 4 Very satisfied Not applicable or don't want to answer 5 6 </code></pre> <p>ok, note that above we can confirm we have imported the Value Labels.</p> <h3>Remove Non-Applicable data from the survey data</h3> <p>Our goal is to now remove the <strong>"Not applicable or don't want to answer"</strong> data entries by setting them to be <strong>"User NA values"</strong> i.e. An SPSS <strong>missing value</strong>.</p> <p>Solution - Step 1 - A Single Column</p> <p>We wish to set the missing value attribute across multiple columns in the data... Let first do this for one column...</p> <p><strong><em>Note we use <code>add_value_labels</code> not <code>set_value_labels</code> as we wish to append a new label, not completely overwrite existing labels...</em></strong></p> <pre class="prettyprint"><code>d <- hospital mean(d$doctor_rating, na.rm = TRUE) d <- hospital %>% add_value_labels( doctor_rating = c( "Not applicable or don't want to answer" = tagged_na("6") )) %>% set_na_values(doctor_rating = 5) val_labels(d$doctor_rating) mean(d$doctor_rating, na.rm = TRUE) </code></pre> <hr> <pre class="prettyprint"><code>> d <- hospital > mean(d$doctor_rating, na.rm = TRUE) [1] 4.322368 > d <- hospital %>% + add_value_labels( doctor_rating = c( "Not applicable or don't want to answer" + = tagged_na("6") )) %>% + set_na_values(doctor_rating = 6) > val_labels(d$doctor_rating) Very dissatisfied Dissatisfied 1 2 Neutral Satisfied 3 4 Very satisfied Not applicable or don't want to answer 5 6 Not applicable or don't want to answer NA > mean(d$doctor_rating, na.rm = TRUE) [1] 4.097015 </code></pre> <h3>Solution - Step 2 - Now apply to multiple columns...</h3> <pre class="prettyprint"><code>mean(hospital$nurse_rating) mean(hospital$nurse_rating, na.rm = TRUE) d <- hospital %>% add_value_labels( doctor_rating = c( "Not applicable or don't want to answer" = tagged_na("6") )) %>% set_na_values(doctor_rating = 6) %>% add_value_labels( nurse_rating = c( "Not applicable or don't want to answer" = tagged_na("6") )) %>% set_na_values(nurse_rating = 6) mean(d$nurse_rating, na.rm = TRUE) </code></pre> <h3>Result</h3> <p>Note that nurse_rating contains "NaN" values <strong>and</strong> NA tagged values. The first mean() call fails, the second succeeds but includes "Not Applicable..." after the filter the "Not Applicable..." are removed...</p> <pre class="prettyprint"><code>> mean(hospital$nurse_rating) [1] NaN > mean(hospital$nurse_rating, na.rm = TRUE) [1] 4.471429 > d <- hospital %>% + add_value_labels( doctor_rating = c( "Not applicable or don't want to answer" + = tagged_na("6") )) %>% + set_na_values(doctor_rating = 6) %>% + add_value_labels( nurse_rating = c( "Not applicable or don't want to answer" + = tagged_na("6") )) %>% + set_na_values(nurse_rating = 6) > mean(d$nurse_rating, na.rm = TRUE) [1] 4.341085 </code></pre> <h3>Convert tagged NA to R NA</h3> <p>Here we take the above tagged NA and convert to R NA values.</p> <pre class="prettyprint"><code>d <- d %>% remove_labels(user_na_to_na = TRUE) </code></pre>

<p>Not quite sure if this is what you are looking for:</p> <pre class="prettyprint"><code>v1 <- labelled(c(1, 2, 2, 2, 5, 6), c(agree = 1, disagree = 2, dk = 5, refused = 6)) v2 <- labelled(c(1, 2, 2, 2, 5, 6), c(agree = 1, disagree = 2, dk = 5, refused = 6)) v3 <- data_frame(v1 = v1, v2 = v2) lapply(names(v3), FUN = function(x) { na_values(v3[[x]]) <<- 5:6 }) lapply(v3, na_values) </code></pre> <p>The last line returning</p> <pre class="prettyprint"><code>$v1 [1] 5 6 $v2 [1] 5 6 </code></pre> <p><strong>Verify missing values</strong>:</p> <pre class="prettyprint"><code>is.na(v3$v1) [1] FALSE FALSE FALSE FALSE TRUE TRUE </code></pre>

set missing values for multiple labelled variables

Tags:

r

tidyverse

r-haven

How to I set missing values for multiple labelled vectors in a data frame. I am working with a survey dataset from spss. I am dealing with about 20 different variables, with the same missing values. So would like to find a way to use lapply() to make this work, but I can't.

I actually can do this with base R via as.numeric() and then recode() but I'm intrigued by the possibilities of haven and the labelled class so I'd like to find a way to do this all in Hadley's tidyverse

Roughly the variables of interest look like this. I am sorry if this is a basic question, but I find the help documentaiton associated with the haven and labelled packages just very unhelpful.

library(haven)
library(labelled)
v1<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=5, refused=6))
v2<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=5, refused=6))
v3<-data.frame(v1=v1, v2=v2)
lapply(v3, val_labels)
lapply(v3, function(x) set_na_values(x, c(5,6)))

909

asked Apr 20 '17 21:04

spindoctor

2 Answers

Ok, I think I understand now what you trying to do...

i.e. Mark the labels, and the values as NA without removing the underlying imported data...

See addendum for a more detailed example that uses a public data file to show an example that harnesses dplyr to update multiple columns, labels...

Proposed Solution

df <- data_frame(s1 = c(1,2,2,2,5,6), s2 = c(1,2,2,2,5,6)) %>%
  set_value_labels(s1 = c(agree=1, disagree=2, dk=5, refused=6), 
                   s2 = c(agree=1, disagree=2, dk = tagged_na("5"), refused = tagged_na("6"))) %>%
  set_na_values(s2 = c(5,6))


val_labels(df)
is.na(df$s1)
is.na(df$s2)
df

Solution Result:

> library(haven)
> library(labelled)
> library(dplyr)
> df <- data_frame(s1 = c(1,2,2,2,5,6), s2 = c(1,2,2,2,5,6)) %>%
+   set_value_labels(s1 = c(agree=1, disagree=2, dk=5, refused=6), 
+                    s2 = c(agree=1, disagree=2, dk = tagged_na("5"), refused = tagged_na("6"))) %>%
+   set_na_values(s2 = c(5,6))
> val_labels(df)
$s1
   agree disagree       dk  refused 
       1        2        5        6 

$s2
   agree disagree       dk  refused 
       1        2       NA       NA 

> is.na(df$s1)
[1] FALSE FALSE FALSE FALSE FALSE FALSE
> is.na(df$s2)
[1] FALSE FALSE FALSE FALSE  TRUE  TRUE
> df
# A tibble: 6 × 2
         s1        s2
  <dbl+lbl> <dbl+lbl>
1         1         1
2         2         2
3         2         2
4         2         2
5         5         5
6         6         6

Now we can manipulate the data

mean(df$s1, na.rm = TRUE)
mean(df$s2, na.rm = TRUE)

> mean(df$s1, na.rm = TRUE)
[1] 3
> mean(df$s2, na.rm = TRUE)
[1] 1.75

Use Labelled package to remove labels and replace with R NA

If you wish to strip the labels and replace with R NA values you can use remove_labels(x, user_na_to_na = TRUE)

Example:

df <- remove_labels(df, user_na_to_na = TRUE)
df

Result:

> df <- remove_labels(df, user_na_to_na = TRUE) 
> df
# A tibble: 6 × 2
     s1    s2
  <dbl> <dbl>
1     1     1
2     2     2
3     2     2
4     2     2
5     5    NA
6     6    NA

Explanation / Overview of SPSS Format:

IBM SPSS (The application) can import and export data in many formats and in non-rectangular configurations; however, the data set is always translated to an SPSS rectangular data file, known as a system file (using the extension *.sav). Metadata (information about the data) such as variable formats, missing values, and variable and value labels are stored with the dataset.

Value Labels

Base R has one data type that effectively maintains a mapping between integers and character labels: the factor. This, however, is not the primary use of factors: they are instead designed to automatically generate useful contrasts for linear models. Factors differ from the labelled values provided by the other tools in important ways:

SPSS and SAS can label numeric and character values, not just integer values.

Missing Values

All three tools (SPSS, SAS, Stata) provide a global “system missing value” which is displayed as .. This is roughly equivalent to R’s NA, although neither Stata nor SAS propagate missingness in numeric comparisons: SAS treats the missing value as the smallest possible number (i.e. -inf), and Stata treats it as the largest possible number (i.e. inf).

Each tool also provides a mechanism for recording multiple types of missingness:

Stata has “extended” missing values, .A through .Z.
SAS has “special” missing values, .A through .Z plus ._.
SPSS has per-column “user” missing values. Each column can declare up to three distinct values or a range of values (plus one distinct value) that should be treated as missing.

User Defined Missing Values

SPSS’s user-defined values work differently to SAS and Stata. Each column can have either up to three distinct values that are considered as missing or a range. Haven provides labelled_spss() as a subclass of labelled() to model these additional user-defined missings.

x1 <- labelled_spss(c(1:10, 99), c(Missing = 99), na_value = 99)
x2 <- labelled_spss(c(1:10, 99), c(Missing = 99), na_range = c(90, Inf))

x1
#> <Labelled SPSS double>
#>  [1]  1  2  3  4  5  6  7  8  9 10 99
#> Missing values: 99
#> 
#> Labels:
#>  value   label
#>     99 Missing
x2
#> <Labelled SPSS double>
#>  [1]  1  2  3  4  5  6  7  8  9 10 99
#> Missing range:  [90, Inf]
#> 
#> Labels:
#>  value   label
#>     99 Missing

Tagged missing values

To support Stata’s extended and SAS’s special missing value, haven implements a tagged NA. It does this by taking advantage of the internal structure of a floating point NA. That allows these values to behave identical to NA in regular R operations, while still preserving the value of the tag.

The R interface for creating with tagged NAs is a little clunky because generally they’ll be created by haven for you. But you can create your own with tagged_na():

Important:

Note these tagged NAs behave identically to regular NAs, even when printing. To see their tags, use print_tagged_na():

Thus:

    library(haven)
    library(labelled)
    v1<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=5, refused=6))
    v2<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=tagged_na("5"), refused= tagged_na("6")))
    v3<-data.frame(v1 = v1, v2 = v2)
    v3
    lapply(v3, val_labels)

> v3
  x x.1
1 1   1
2 2   2
3 2   2
4 2   2
5 5   5
6 6   6
> lapply(v3, val_labels)
$x
   agree disagree       dk  refused 
       1        2        5        6 

$x.1
   agree disagree       dk  refused 
       1        2       NA       NA

Word of caution:

SPSS’s user-defined values work differently to SAS and Stata. Each column can have either up to three distinct values that are considered as missing, or a range. Haven provides labelled_spss() as a subclass of labelled() to model these additional user-defined missings.

I hope the above helps

Take care T.

References:

https://cran.r-project.org/web/packages/haven/haven.pdf
https://cran.r-project.org/web/packages/haven/vignettes/semantics.html
https://www.spss-tutorials.com/spss-missing-values-tutorial/

Addendum Example using Public Data...

SPSS Missing Values Example using an SPPS Data file {hospital.sav}

Firstly, let's make sure we highlight that

System missing values - are values that are completely absent from the data
User missing values are values that are present in the data but must be excluded from calculations.

SPSS View of Data...

Let's review the image and the data... The SPSS data shown in the variable view shows that each row has a Label [Column5], we note that rows 10 through 14 have specific values attributed to them [1..6] [Column 6] that have name attributes and that no values have been specified as Missing [Column 7].

enter image description here

Now let's look at the SPSS data view:

Here we can note that there is missing data... (See hilighted "."'s). The key point is that we have Missing data, but currently have no "Missing User Values"

enter image description here

Now let's turn to R, and load the data into R

hospital_url <- "https://www.spss-tutorials.com/downloads/hospital.sav"
hospital <- read_sav(hospital_url, 
                     user_na = FALSE)
head(hospital,5)

# We're interested in columns 10 through 14...
head(hospital[10:14],5)

Result

> hospital_url <- "https://www.spss-tutorials.com/downloads/hospital.sav"
> hospital <- read_sav(hospital_url, 
+                      user_na = FALSE)
> head(hospital,5)
# A tibble: 5 × 14
  visit_id patient_id first_name surname_prefix last_name    gender entry_date entry_time
     <dbl>      <dbl>      <chr>          <chr>     <chr> <dbl+lbl>     <date>     <time>
1    32943      23176    JEFFREY                 DIJKSTRA         1 2013-01-08   16:56:10
2    32944      20754       MARK        VAN DER      BERG         1 2013-02-01   14:24:45
3    32945      25419     WILLEM                VERMEULEN         1 2013-02-02   10:01:43
4    32946      21139      LINDA                  JANSSEN         0 2013-02-10   10:24:39
5    32947      25419     WILLEM                VERMEULEN         1 2013-02-10   18:05:59
# ... with 6 more variables: exit_moment <dttm>, doctor_rating <dbl+lbl>, nurse_rating <dbl+lbl>,
#   room_rating <dbl+lbl>, food_rating <dbl+lbl>, facilities_rating <dbl+lbl>

Columns 10 through 14 contain Values

1="Very Dissatisfied"
2="Dissatisfied"
3="Neutral"
4="Satisfied"
5="Very Satisfied"
6="Not applicable or don't want to answer"

thus:

> head(hospital[10:14],5)
# A tibble: 5 × 5
  doctor_rating nurse_rating room_rating food_rating facilities_rating
      <dbl+lbl>    <dbl+lbl>   <dbl+lbl>   <dbl+lbl>         <dbl+lbl>
1             5            5           4           2                 3
2             4            5           4           3                 3
3             5            6           4           5                 4
4             4            5           5           4                 4
5             5            5           6           6                 6

SPSS Value Labels

> lapply(hospital[10], val_labels)
$doctor_rating
                     Very dissatisfied                           Dissatisfied 
                                     1                                      2 
                               Neutral                              Satisfied 
                                     3                                      4 
                        Very satisfied Not applicable or don't want to answer 
                                     5                                      6

ok, note that above we can confirm we have imported the Value Labels.

Remove Non-Applicable data from the survey data

Our goal is to now remove the "Not applicable or don't want to answer" data entries by setting them to be "User NA values" i.e. An SPSS missing value.

Solution - Step 1 - A Single Column

We wish to set the missing value attribute across multiple columns in the data... Let first do this for one column...

Note we use add_value_labels not set_value_labels as we wish to append a new label, not completely overwrite existing labels...

d <- hospital
mean(d$doctor_rating, na.rm = TRUE)

d <- hospital %>% 
  add_value_labels( doctor_rating = c( "Not applicable or don't want to answer" 
                                       = tagged_na("6") )) %>%
  set_na_values(doctor_rating = 5)

val_labels(d$doctor_rating)
mean(d$doctor_rating, na.rm = TRUE)

> d <- hospital
> mean(d$doctor_rating, na.rm = TRUE)
[1] 4.322368
> d <- hospital %>% 
+   add_value_labels( doctor_rating = c( "Not applicable or don't want to answer" 
+                                        = tagged_na("6") )) %>%
+   set_na_values(doctor_rating = 6)
> val_labels(d$doctor_rating)
                     Very dissatisfied                           Dissatisfied 
                                     1                                      2 
                               Neutral                              Satisfied 
                                     3                                      4 
                        Very satisfied Not applicable or don't want to answer 
                                     5                                      6 
Not applicable or don't want to answer 
                                    NA 
> mean(d$doctor_rating, na.rm = TRUE)
[1] 4.097015

Solution - Step 2 - Now apply to multiple columns...

mean(hospital$nurse_rating)
mean(hospital$nurse_rating, na.rm = TRUE)
d <- hospital %>% 
  add_value_labels( doctor_rating = c( "Not applicable or don't want to answer" 
                                       = tagged_na("6") )) %>%
  set_na_values(doctor_rating = 6) %>%
  add_value_labels( nurse_rating = c( "Not applicable or don't want to answer" 
                                     = tagged_na("6") )) %>%
  set_na_values(nurse_rating = 6)
mean(d$nurse_rating, na.rm = TRUE)

Result

Note that nurse_rating contains "NaN" values and NA tagged values. The first mean() call fails, the second succeeds but includes "Not Applicable..." after the filter the "Not Applicable..." are removed...

> mean(hospital$nurse_rating)
[1] NaN
> mean(hospital$nurse_rating, na.rm = TRUE)
[1] 4.471429
> d <- hospital %>% 
+   add_value_labels( doctor_rating = c( "Not applicable or don't want to answer" 
+                                        = tagged_na("6") )) %>%
+   set_na_values(doctor_rating = 6) %>%
+   add_value_labels( nurse_rating = c( "Not applicable or don't want to answer" 
+                                      = tagged_na("6") )) %>%
+   set_na_values(nurse_rating = 6)
> mean(d$nurse_rating, na.rm = TRUE)
[1] 4.341085

Convert tagged NA to R NA

Here we take the above tagged NA and convert to R NA values.

d <- d %>% remove_labels(user_na_to_na = TRUE)

answered Oct 09 '22 21:10

Technophobe01

Not quite sure if this is what you are looking for:

v1 <- labelled(c(1, 2, 2, 2, 5, 6), c(agree = 1, disagree = 2, dk = 5, refused = 6))
v2 <- labelled(c(1, 2, 2, 2, 5, 6), c(agree = 1, disagree = 2, dk = 5, refused = 6))
v3 <- data_frame(v1 = v1, v2 = v2)

lapply(names(v3), FUN = function(x) {
  na_values(v3[[x]]) <<- 5:6
})

lapply(v3, na_values)

The last line returning

$v1
[1] 5 6

$v2
[1] 5 6

Verify missing values:

is.na(v3$v1)
[1] FALSE FALSE FALSE FALSE  TRUE  TRUE

answered Oct 09 '22 21:10

Martin Schmelzer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

set missing values for multiple labelled variables

Tags:

r

tidyverse

r-haven

spindoctor

People also ask

2 Answers

Proposed Solution

Solution Result:

Now we can manipulate the data

Use Labelled package to remove labels and replace with R NA

Example:

Result:

Explanation / Overview of SPSS Format:

Value Labels

Missing Values

User Defined Missing Values

Tagged missing values

Important:

Thus:

Word of caution:

References:

Addendum Example using Public Data...

SPSS Missing Values Example using an SPPS Data file {hospital.sav}

SPSS View of Data...

Now let's turn to R, and load the data into R

Result

Columns 10 through 14 contain Values

SPSS Value Labels

Remove Non-Applicable data from the survey data

Solution - Step 2 - Now apply to multiple columns...

Result

Convert tagged NA to R NA

Technophobe01

Martin Schmelzer

Recent Activity

Donate For Us

set missing values for multiple labelled variables

Tags:

r

tidyverse

r-haven

spindoctor

People also ask

2 Answers

Proposed Solution

Solution Result:

Now we can manipulate the data

Use Labelled package to remove labels and replace with R NA

Example:

Result:

Explanation / Overview of SPSS Format:

Value Labels

Missing Values

User Defined Missing Values

Tagged missing values

Important:

Thus:

Word of caution:

References:

Addendum Example using Public Data...

SPSS Missing Values Example using an SPPS Data file {hospital.sav}

SPSS View of Data...

Now let's turn to R, and load the data into R

Result

Columns 10 through 14 contain Values

SPSS Value Labels

Remove Non-Applicable data from the survey data

Solution - Step 2 - Now apply to multiple columns...

Result

Convert tagged NA to R NA

Technophobe01

Martin Schmelzer

Related questions

Recent Activity

Donate For Us