Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle example data in R Package that has UTF-8 marked strings

Tags:

r

utf-8

twitter

I would like to include an example dataset (of Twitter tweets and metadata) in an R Package I'm writing.

I downloaded an example data.frame using the Twitter API and saved it as .RData (with the corresponding .R data description file) in my package.

When I run R CMD Check, I get the following NOTE,

 * checking data for non-ASCII characters ... NOTE
 Note: found 287 marked UTF-8 strings

I tried saving the data.frame with ASCII=TRUE, hoping this would fix the problem. But it persists. Any idea on how I can get R CMD CHECK to run without notes?

(also, I would be open to removing all UTF-8 marked strings from the example data if that's the solution). Thank you!

example row from data.frame:

First time in SF (@ San Francisco International Airport (SFO) - @flysfo in San Francisco, CA) https://t.co/1245xqxtwesr
  favorited favoriteCount replyToSN             created truncated replyToSID                 id replyToUID
1     FALSE             0      <NA> 2015-03-13 23:30:35     FALSE       <NA> 576525795927179264       <NA>
                                                   statusSource screenName retweetCount isRetweet retweeted
1 <a href="http://foursquare.com" rel="nofollow">Foursquare</a>  my_name93            0     FALSE     FALSE
      longitude    latitude
1 -122.38100052 37.61865062
like image 689
Rocinante Avatar asked Mar 14 '15 00:03

Rocinante


1 Answers

In case it's useful to anyone in the future, the resolution I found is this:

The UTF-8 marked characters were in the dataset because Twitter tweets sometimes include emoji's.

The advice I was given is that there isn't a straightforward way to get rid of the NOTE in the PACKAGE CMD CHECK without just removing all of the UTF-8 marked strings.

To do this, I used the command:

nonUTF <- iconv(df$TroubleVector, from="UTF-8", to="ASCII")

on the vector that had emoji's, etc. This command returned NA if the value had UTF-8 marked strings. I used this to subset the dataset - now I get a clean build.

like image 90
Rocinante Avatar answered Oct 22 '22 07:10

Rocinante