I would like to include an example dataset (of Twitter
tweets and metadata
) in an R
Package I'm writing.
I downloaded an example data.frame using the Twitter API
and saved it as .RData
(with the corresponding .R
data description file) in my package.
When I run R CMD
Check, I get the following NOTE,
* checking data for non-ASCII characters ... NOTE
Note: found 287 marked UTF-8 strings
I tried saving the data.frame
with ASCII=TRUE
, hoping this would fix the problem. But it persists. Any idea on how I can get R CMD
CHECK to run without notes?
(also, I would be open to removing all UTF-8
marked strings from the example data if that's the solution). Thank you!
example row from data.frame:
First time in SF (@ San Francisco International Airport (SFO) - @flysfo in San Francisco, CA) https://t.co/1245xqxtwesr
favorited favoriteCount replyToSN created truncated replyToSID id replyToUID
1 FALSE 0 <NA> 2015-03-13 23:30:35 FALSE <NA> 576525795927179264 <NA>
statusSource screenName retweetCount isRetweet retweeted
1 <a href="http://foursquare.com" rel="nofollow">Foursquare</a> my_name93 0 FALSE FALSE
longitude latitude
1 -122.38100052 37.61865062
In case it's useful to anyone in the future, the resolution I found is this:
The UTF-8 marked characters were in the dataset because Twitter tweets sometimes include emoji's.
The advice I was given is that there isn't a straightforward way to get rid of the NOTE in the PACKAGE CMD CHECK without just removing all of the UTF-8 marked strings.
To do this, I used the command:
nonUTF <- iconv(df$TroubleVector, from="UTF-8", to="ASCII")
on the vector that had emoji's, etc. This command returned NA if the value had UTF-8 marked strings. I used this to subset the dataset - now I get a clean build.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With