Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove punctuation but keeping emoticons?

Is that possible to remove all the punctuations but keeping the emoticons such as

:-(

:)

:D

:p

structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label =     c("ãããæããããéãããæãããInappropriate announce:-(", 
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something     you are working to fix?", 
"@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)", 
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D", 
"xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...", 
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L, 
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54", 
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text", 
"created"), class = "data.frame", row.names = c(NA, -6L))
like image 743
user3456230 Avatar asked May 03 '14 16:05

user3456230


People also ask

Do you punctuate after emojis?

Put emojis after the punctuation in a sentence. In English, we generally place symbols at the end of a sentence, which is why it's time to formalize this rule when using emojis with other writing. If you're going to include an emoji (or a series of emojis) in your sentence, add them after your punctuation.

How do you exclude emojis?

TouchRetouch is an app that is for object removal . You can remove any kind of object, smiley, emoji, or stickers from a photo with this app easily. It is also available on iOS and Android. By using TouchRetouch, you can just mark the area of the object, sticker, or emoji and tap on the erase button.

How do I remove emoticons from tweets in Python?

To remove the emojis, we set the parameter no_emoji to True .


2 Answers

I added this functionality to qdap version > 2.0.0 as the sub_holder function. Basically this function uses the response I gave above but lightens the coding load. The sub_holder function takes a text vector and items you want to sub out (such as emoticons). It returns a list with:

  1. the vector of the test with the items subbed out for place holders
  2. A function (called unhold) that swaps the holders for the original terms

Here's the code:

emos <- c(":-(", ":)", ":D", ":p", "X-(")
(m <- sub_holder(emos, dat[,1]))
m$unhold(strip(m$output, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))
like image 191
Tyler Rinker Avatar answered Sep 30 '22 12:09

Tyler Rinker


1. A working pure-regex solution (a.k.a. Edit#2)

This task can be done purely with regular expressions (many thanks to @Mike Samuel)

First we build a database of emoticons:

(emots <- as.character(outer(c(":", ";", ":-", ";-"),
+                c(")", "(", "]", "[", "D", "o", "O", "P", "p"), stri_paste)))
## [1] ":)"  ";)"  ":-)" ";-)" ":("  ";("  ":-(" ";-(" ":]"  ";]"  ":-]" ";-]" ":["  ";["  ":-[" ";-[" ":D"  ";D"  ":-D" ";-D"
## [21] ":o"  ";o"  ":-o" ";-o" ":O"  ";O"  ":-O" ";-O" ":P"  ";P"  ":-P" ";-P" ":p"  ";p"  ":-p" ";-p"

An exemplary input text:

text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"

A helper function that escapes some special characters so that they may be used in a regex pattern (using the stringi package):

library(stringi)
escape_regex <- function(r) {
   stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}

A regular expression to match the emoticons:

(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"

Now, as @Mike Samuel suggested below, we just match (emoticon)|punctuation (note that emoticons are in a capturing group) and then replace the matches with the result of capturing group 1 (so if it's an emoticon, we have replacement=this emoticon, if it's a punctuation char, we have replacement=nothing). This will work because the alternation with | in ICU Regex (which is the regex engine used in stri_replace_all_regex) is greedy and left-biased: emoticons will be matched earlier than the punctuation characters.

stri_replace_all_regex(text, stri_c(regex1, "|\\p{P}"), "$1")
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-)  and the salesperson said Oh boy"

BTW, if you want to get rid only of a selected set of characters, put e.g. [.,] instead of [\\p{P}] above.

2. Regex solution hint - my first (not wise) attempt (a.k.a. original answer)

My very first idea (left here mainly for "historical reasons") was to approach this problem by using look-aheads and look-behinds, but - as you see - that's far from perfect.

To remove all : and ; not followed by ), (, D, X, 8, [, or ] use negative look-behind:

stri_replace_all_regex(text, "[:;](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P -) --- and the salesperson said Oh boy!"

Now we can add some old-school emoticons (with noses, e.g. :-), ;-D etc.)

stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-) --- and the salesperson said Oh boy!"

Now the hyphens removal (negative look behind and look ahead)

stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])|(?!<[:;])[-](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-)  and the salesperson said Oh boy!"

and so on. Of course, first you should build your own database of emoticons (to leave as they are) and punctuation marks (to remove). The regex will highly depend on these two sets, so it will be difficult to add new emoticons --- it's definitely not worth applying (and may twist your brain).

3. The second attempt (regex-dumb friendlier, a.k.a. Edit#1)

On the other hand, if you're allergic to complex regexes, try this. This approach has some "didactic benefits" - we have full insight on what's being done in each of the following steps:

  1. Locate all emoticons within text;
  2. Locate all the punctuation characters within text;
  3. Find positions of the punctuation characters that are not parts of emoticons;
  4. Remove the characters located in step 3.

An exemplary input text - 1 string only - a generalized case is left as an exercise ;)

text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"

A helper function that escapes some special characters so that they may be used in a regex:

escape_regex <- function(r) {
   library("stringi")
   stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}

A regular expression to match the emoticons:

(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"

Locate the start and end positions of all the emoticons (i.e. locate the first OR the second OR ... emoticon):

where_emots <- stri_locate_all_regex(text, regex1)[[1]] # only for the first string of text
print(where_emots)
##       start end
##  [1,]     1   2
##  [2,]     4   5
##  [3,]     7   8
##  [4,]    10  11
##  [5,]    13  14
##  [6,]    16  17
##  [7,]    23  24
##  [8,]    64  65
##  [9,]    67  69

Locate all the punctuation chars (Here \\p{P} is the Unicode character class representing punctuation characters):

where_punct <- stri_locate_all_regex(text, "\\p{P}")[[1]]
print(where_punct)
##       start end
##  [1,]     1   1
##  [2,]     2   2
##  [3,]     4   4
##  [4,]     7   7
##  [5,]     8   8
## ...
## [26,]    72  72
## [27,]    73  73
## [28,]    99  99
## [29,]   107 107

As some punctuation chars occur within the emoticons, we should not stage them for removal:

which_punct_omit <- sapply(1:nrow(where_punct), function(i) {
   any(where_punct[i,1] >= where_emots[,1] &
        where_punct[i,2] <= where_emots[,2]) })
where_punct <- where_punct[!which_punct_omit,] # update where_punct
print(where_punct)
##       start end
##  [1,]    27  27
##  [2,]    38  38
##  [3,]    39  39
##  [4,]    40  40
##  [5,]    46  46
##  [6,]    54  54
##  [7,]    58  58
##  [8,]    60  60
##  [9,]    71  71
## [10,]    72  72
## [11,]    73  73
## [12,]    99  99
## [13,]   107 107

Each punctuation mark surely consists only of 1 character, thus always where_punct[,1]==where_punct[,2].

Now the final part. As you see, where_punct[,1] contains the positions of characters to be removed. IMHO the easiest way to do that (without loops) is by converting a string to UTF-32 (each character == 1 integer), remove undesired elements, and then go back to the textual representation:

text_tmp <- stri_enc_toutf32(text)[[1]]
print(text_tmp) # here - just ASCII codes...
## [1]  58  41  32  59  80  32  58  93  32  58....
text_tmp <- text_tmp[-where_punct[,1]] # removal, but be sure that where_punct is not empty!

And the result is:

stri_enc_fromutf32(text_tmp)
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-)  and the salesperson said Oh boy"

Here you are.

like image 32
gagolews Avatar answered Sep 30 '22 12:09

gagolews