Remove punctuation but keeping emoticons?

Tags:

Is that possible to remove all the punctuations but keeping the emoticons such as

:-(

structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label =     c("ãããæããããéãããæãããInappropriate announce:-(", 
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something     you are working to fix?", 
"@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)", 
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D", 
"xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...", 
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L, 
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54", 
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text", 
"created"), class = "data.frame", row.names = c(NA, -6L))

743

asked May 03 '14 16:05

user3456230

2 Answers

I added this functionality to qdap version > 2.0.0 as the sub_holder function. Basically this function uses the response I gave above but lightens the coding load. The sub_holder function takes a text vector and items you want to sub out (such as emoticons). It returns a list with:

the vector of the test with the items subbed out for place holders
A function (called unhold) that swaps the holders for the original terms

Here's the code:

emos <- c(":-(", ":)", ":D", ":p", "X-(")
(m <- sub_holder(emos, dat[,1]))
m$unhold(strip(m$output, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))

191

answered Sep 30 '22 12:09

Tyler Rinker

1. A working pure-regex solution (a.k.a. Edit#2)

This task can be done purely with regular expressions (many thanks to @Mike Samuel)

First we build a database of emoticons:

(emots <- as.character(outer(c(":", ";", ":-", ";-"),
+                c(")", "(", "]", "[", "D", "o", "O", "P", "p"), stri_paste)))
## [1] ":)"  ";)"  ":-)" ";-)" ":("  ";("  ":-(" ";-(" ":]"  ";]"  ":-]" ";-]" ":["  ";["  ":-[" ";-[" ":D"  ";D"  ":-D" ";-D"
## [21] ":o"  ";o"  ":-o" ";-o" ":O"  ";O"  ":-O" ";-O" ":P"  ";P"  ":-P" ";-P" ":p"  ";p"  ":-p" ";-p"

An exemplary input text:

text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"

A helper function that escapes some special characters so that they may be used in a regex pattern (using the stringi package):

library(stringi)
escape_regex <- function(r) {
   stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}

A regular expression to match the emoticons:

(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"

Now, as @Mike Samuel suggested below, we just match (emoticon)|punctuation (note that emoticons are in a capturing group) and then replace the matches with the result of capturing group 1 (so if it's an emoticon, we have replacement=this emoticon, if it's a punctuation char, we have replacement=nothing). This will work because the alternation with | in ICU Regex (which is the regex engine used in stri_replace_all_regex) is greedy and left-biased: emoticons will be matched earlier than the punctuation characters.

stri_replace_all_regex(text, stri_c(regex1, "|\\p{P}"), "$1")
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-)  and the salesperson said Oh boy"

BTW, if you want to get rid only of a selected set of characters, put e.g. [.,] instead of [\\p{P}] above.

2. Regex solution hint - my first (not wise) attempt (a.k.a. original answer)

My very first idea (left here mainly for "historical reasons") was to approach this problem by using look-aheads and look-behinds, but - as you see - that's far from perfect.

To remove all : and ; not followed by ), (, D, X, 8, [, or ] use negative look-behind:

stri_replace_all_regex(text, "[:;](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P -) --- and the salesperson said Oh boy!"

Now we can add some old-school emoticons (with noses, e.g. :-), ;-D etc.)

stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-) --- and the salesperson said Oh boy!"

Now the hyphens removal (negative look behind and look ahead)

stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])|(?!<[:;])[-](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-)  and the salesperson said Oh boy!"

and so on. Of course, first you should build your own database of emoticons (to leave as they are) and punctuation marks (to remove). The regex will highly depend on these two sets, so it will be difficult to add new emoticons --- it's definitely not worth applying (and may twist your brain).

3. The second attempt (regex-dumb friendlier, a.k.a. Edit#1)

On the other hand, if you're allergic to complex regexes, try this. This approach has some "didactic benefits" - we have full insight on what's being done in each of the following steps:

Locate all emoticons within text;
Locate all the punctuation characters within text;
Find positions of the punctuation characters that are not parts of emoticons;
Remove the characters located in step 3.

An exemplary input text - 1 string only - a generalized case is left as an exercise ;)

text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"

A helper function that escapes some special characters so that they may be used in a regex:

escape_regex <- function(r) {
   library("stringi")
   stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}

A regular expression to match the emoticons:

(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"

Locate the start and end positions of all the emoticons (i.e. locate the first OR the second OR ... emoticon):

where_emots <- stri_locate_all_regex(text, regex1)[[1]] # only for the first string of text
print(where_emots)
##       start end
##  [1,]     1   2
##  [2,]     4   5
##  [3,]     7   8
##  [4,]    10  11
##  [5,]    13  14
##  [6,]    16  17
##  [7,]    23  24
##  [8,]    64  65
##  [9,]    67  69

Locate all the punctuation chars (Here \\p{P} is the Unicode character class representing punctuation characters):

where_punct <- stri_locate_all_regex(text, "\\p{P}")[[1]]
print(where_punct)
##       start end
##  [1,]     1   1
##  [2,]     2   2
##  [3,]     4   4
##  [4,]     7   7
##  [5,]     8   8
## ...
## [26,]    72  72
## [27,]    73  73
## [28,]    99  99
## [29,]   107 107

As some punctuation chars occur within the emoticons, we should not stage them for removal:

which_punct_omit <- sapply(1:nrow(where_punct), function(i) {
   any(where_punct[i,1] >= where_emots[,1] &
        where_punct[i,2] <= where_emots[,2]) })
where_punct <- where_punct[!which_punct_omit,] # update where_punct
print(where_punct)
##       start end
##  [1,]    27  27
##  [2,]    38  38
##  [3,]    39  39
##  [4,]    40  40
##  [5,]    46  46
##  [6,]    54  54
##  [7,]    58  58
##  [8,]    60  60
##  [9,]    71  71
## [10,]    72  72
## [11,]    73  73
## [12,]    99  99
## [13,]   107 107

Each punctuation mark surely consists only of 1 character, thus always where_punct[,1]==where_punct[,2].

Now the final part. As you see, where_punct[,1] contains the positions of characters to be removed. IMHO the easiest way to do that (without loops) is by converting a string to UTF-32 (each character == 1 integer), remove undesired elements, and then go back to the textual representation:

text_tmp <- stri_enc_toutf32(text)[[1]]
print(text_tmp) # here - just ASCII codes...
## [1]  58  41  32  59  80  32  58  93  32  58....
text_tmp <- text_tmp[-where_punct[,1]] # removal, but be sure that where_punct is not empty!

And the result is:

stri_enc_fromutf32(text_tmp)
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-)  and the salesperson said Oh boy"

Here you are.

answered Sep 30 '22 12:09

gagolews

Related questions
                            
                                What is the limit in size of a Php variable when storing string?
                            
                                Replace a sequence of values by group depending on preceeding values
                            
                                Why is this appearing in my c# strings: Â£
                            
                                Returning a std::string from a C++ DLL to a c# program -> Invalid Address specified to RtlFreeHeap
                            
                                String.comparison performance (with trim)
                            
                                Are strings created with + concatenation stored in the string pool?
                            
                                MeasureString and DrawString difference
                            
                                Java 7 switch statement with strings not working
                            
                                What are some good methods to replace string names with integer hashes
                            
                                How to split a string by x amount of characters
                            
                                How can I convert a std::basic_string type to an array of char type?
                            
                                Converting array string to string and back in Java
                            
                                appending character array to string c++
                            
                                Array map and parseInt woes [duplicate]
                            
                                file.readlines leaving blank lines [duplicate]
                            
                                C - why is only char array null terminated? [closed]
                            
                                Why does "auto" declare strings as const char* instead of std::string?
                            
                                Are VBA strings immutable?
                            
                                String ordering in Lua
                            
                                Optimizing strings in Cython

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Remove punctuation but keeping emoticons?

Tags:

string

text

r

gsub

emoticons