Is that possible to remove all the punctuations but keeping the emoticons such as
:-(
:)
:D
:p
structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label = c("ãããæããããéãããæãããInappropriate announce:-(",
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something you are working to fix?",
"@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)",
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D",
"xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...",
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L,
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54",
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text",
"created"), class = "data.frame", row.names = c(NA, -6L))
Put emojis after the punctuation in a sentence. In English, we generally place symbols at the end of a sentence, which is why it's time to formalize this rule when using emojis with other writing. If you're going to include an emoji (or a series of emojis) in your sentence, add them after your punctuation.
TouchRetouch is an app that is for object removal . You can remove any kind of object, smiley, emoji, or stickers from a photo with this app easily. It is also available on iOS and Android. By using TouchRetouch, you can just mark the area of the object, sticker, or emoji and tap on the erase button.
To remove the emojis, we set the parameter no_emoji to True .
I added this functionality to qdap version > 2.0.0
as the sub_holder
function. Basically this function uses the response I gave above but lightens the coding load. The sub_holder
function takes a text vector and items you want to sub out (such as emoticons). It returns a list with:
unhold
) that swaps the holders for the original terms Here's the code:
emos <- c(":-(", ":)", ":D", ":p", "X-(")
(m <- sub_holder(emos, dat[,1]))
m$unhold(strip(m$output, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))
This task can be done purely with regular expressions (many thanks to @Mike Samuel)
First we build a database of emoticons:
(emots <- as.character(outer(c(":", ";", ":-", ";-"),
+ c(")", "(", "]", "[", "D", "o", "O", "P", "p"), stri_paste)))
## [1] ":)" ";)" ":-)" ";-)" ":(" ";(" ":-(" ";-(" ":]" ";]" ":-]" ";-]" ":[" ";[" ":-[" ";-[" ":D" ";D" ":-D" ";-D"
## [21] ":o" ";o" ":-o" ";-o" ":O" ";O" ":-O" ";-O" ":P" ";P" ":-P" ";-P" ":p" ";p" ":-p" ";-p"
An exemplary input text:
text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"
A helper function that escapes some special characters so that they may be used in a regex pattern (using the stringi package):
library(stringi)
escape_regex <- function(r) {
stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}
A regular expression to match the emoticons:
(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"
Now, as @Mike Samuel suggested below, we just match (emoticon)|punctuation
(note that emoticons are in a capturing group) and then replace the matches
with the result of capturing group 1 (so if it's an emoticon, we have replacement=this emoticon, if it's a punctuation char, we have replacement=nothing). This will work because the alternation with |
in ICU Regex (which is the regex engine used in stri_replace_all_regex
) is greedy and left-biased: emoticons will be matched earlier than the punctuation characters.
stri_replace_all_regex(text, stri_c(regex1, "|\\p{P}"), "$1")
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-) and the salesperson said Oh boy"
BTW, if you want to get rid only of a selected set of characters, put e.g. [.,]
instead of [\\p{P}]
above.
My very first idea (left here mainly for "historical reasons") was to approach this problem by using look-aheads and look-behinds, but - as you see - that's far from perfect.
To remove all :
and ;
not followed by )
, (
, D
, X
, 8
, [
, or ]
use negative look-behind:
stri_replace_all_regex(text, "[:;](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P -) --- and the salesperson said Oh boy!"
Now we can add some old-school emoticons (with noses, e.g. :-)
, ;-D
etc.)
stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-) --- and the salesperson said Oh boy!"
Now the hyphens removal (negative look behind and look ahead)
stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])|(?!<[:;])[-](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-) and the salesperson said Oh boy!"
and so on. Of course, first you should build your own database of emoticons (to leave as they are) and punctuation marks (to remove). The regex will highly depend on these two sets, so it will be difficult to add new emoticons --- it's definitely not worth applying (and may twist your brain).
On the other hand, if you're allergic to complex regexes, try this. This approach has some "didactic benefits" - we have full insight on what's being done in each of the following steps:
text
;text
;An exemplary input text - 1 string only - a generalized case is left as an exercise ;)
text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"
A helper function that escapes some special characters so that they may be used in a regex:
escape_regex <- function(r) {
library("stringi")
stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}
A regular expression to match the emoticons:
(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"
Locate the start and end positions of all the emoticons (i.e. locate the first OR the second OR ... emoticon):
where_emots <- stri_locate_all_regex(text, regex1)[[1]] # only for the first string of text
print(where_emots)
## start end
## [1,] 1 2
## [2,] 4 5
## [3,] 7 8
## [4,] 10 11
## [5,] 13 14
## [6,] 16 17
## [7,] 23 24
## [8,] 64 65
## [9,] 67 69
Locate all the punctuation chars (Here \\p{P}
is the Unicode character class representing punctuation characters):
where_punct <- stri_locate_all_regex(text, "\\p{P}")[[1]]
print(where_punct)
## start end
## [1,] 1 1
## [2,] 2 2
## [3,] 4 4
## [4,] 7 7
## [5,] 8 8
## ...
## [26,] 72 72
## [27,] 73 73
## [28,] 99 99
## [29,] 107 107
As some punctuation chars occur within the emoticons, we should not stage them for removal:
which_punct_omit <- sapply(1:nrow(where_punct), function(i) {
any(where_punct[i,1] >= where_emots[,1] &
where_punct[i,2] <= where_emots[,2]) })
where_punct <- where_punct[!which_punct_omit,] # update where_punct
print(where_punct)
## start end
## [1,] 27 27
## [2,] 38 38
## [3,] 39 39
## [4,] 40 40
## [5,] 46 46
## [6,] 54 54
## [7,] 58 58
## [8,] 60 60
## [9,] 71 71
## [10,] 72 72
## [11,] 73 73
## [12,] 99 99
## [13,] 107 107
Each punctuation mark surely consists only of 1 character, thus always where_punct[,1]==where_punct[,2]
.
Now the final part. As you see, where_punct[,1]
contains the positions of characters to be removed. IMHO the easiest way to do that (without loops) is by converting a string to UTF-32 (each character == 1 integer), remove undesired elements, and then go back to the textual representation:
text_tmp <- stri_enc_toutf32(text)[[1]]
print(text_tmp) # here - just ASCII codes...
## [1] 58 41 32 59 80 32 58 93 32 58....
text_tmp <- text_tmp[-where_punct[,1]] # removal, but be sure that where_punct is not empty!
And the result is:
stri_enc_fromutf32(text_tmp)
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-) and the salesperson said Oh boy"
Here you are.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With