Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expressions data clearance using regex serials in R

Tags:

regex

r

A list of keywords contains Chinese Characters and English Words, just like below:

[1] "    服务 接口   知识 组织   开放 查询   语义 推理   Web   服务 "                                                                        
[2] "    Solr   分面 搜索   标准 信息管理 "  
[3] "  语义   W i k i   标注   导航   检索   S e m a n t i c M e d i a W i k i   P A U X   I k e W i k i    "
[4] "  Liferay   主从 模式   集成 知识 平台    " 
[5] "    数据 摄取   SKE   本体   属性 映射   三元组 存储    "

Some of English Words has space between each character(such as the 3rd row), “W i k i”, “S e m a n t i c M e d i a W i k i”, “P A U X”, “I k e W i k i”. Among these words, there are more than two space. Now I am trying to delete the space in these English Words to the results: “Wiki”, “SemanticMediaWiki”, “PAUX”, “IkeWiki”, and also keep other words like before. I used “gsub” before like this: “kwdict<-gsub("^[[:alpha:][:blank:]]+", "\\w", kwdict)”. But no matter I use "\w" or “[[:alpha:]]”, the results are wrong, all the words have been changed. How can we choose these English Words accurately and delete the space in it?

[1] "    服务 接口   知识 组织   开放 查询   语义 推理   Web   服务 "                                                                        
[2] "    Solr   分面 搜索   标准 信息管理 "                                                                            [3] "  语义   Wiki   标注   导航   检索   SemanticMediaWiki   PAUX   IkeWiki    "
[4] "  Liferay   主从 模式   集成 知识 平台    "                                                                         [5] "    数据 摄取   SKE   本体   属性 映射   三元组 存储    "

I tried many times using R with these sentences below separately

kwdict<-gsub("[[:alpha:]/[:space:]{1}]", "", kwdict)
kwdict<-gsub("[^[:alpha:]_[:space:]]{1}", "", kwdict)
kwdict<-gsub("[^[:alpha:][:space:]]{1}", "", kwdict)
kwdict<-gsub("[^[:alpha:][:space:]{1}^[:alpha:]]", "", kwdict)
kwdict<-gsub("[//>[:space:]{1}]", "", kwdict)
kwdict<-gsub("[[:alpha:][:space:]{1}]", "", kwdict)

But it done nothing, delete all the spaces, or even clear all the words! I think that because the pattern include “[:alpha:]” the start mark we used to locate space character. Is there any idea to define this pattern correctly using R?

like image 604
赵鸿丰 Avatar asked Oct 29 '22 11:10

赵鸿丰


1 Answers

Thanks to the some of the comments by @赵鸿丰 and @waterling

I think I am able to found the source of your problem, The problem is that those words which you think are english alphabets they are not ascii in nature.They are actually latin upper and lower case of english alphabets. There are however, some of the alphabets are in english("Solar" and "Liferay").

Run the below command to convert this to UTF-8 (You may not need to do this, I am comfortable in seeing things in UTF-8 format and also doing google gives me little better results in terms of UTF-8)

string <- c("    服务 接口   知识 组织   开放 查询   语义 推理   Web   服务 ",      
             "    Solr   分面 搜索   标准 信息管理 "  ,
             "  语义   W i k i   标注   导航   检索   S e m a n t i c M e d i a W i k i   P A U X   I k e W i k i    ",
             "  Liferay   主从 模式   集成 知识 平台    " ,
             "    数据 摄取   SKE   本体   属性 映射   三元组 存储    ")

Encoding(string) <- "UTF-8"

Once you run the above command you can see, there are UTF-8 values attached with these characters. I have searched the internet to find what these value translates into. I stumbled upon this site. These helped me to understand the UTF-8 values associated with it.

So I wrote small regex to solve your problem, I have used stringr library. You may choose any library/BASE R gsub to solve your problem.

value <- str_replace_all(string,'(?<=[\U{FF41}-\U{FF5A}]|[\U{FF21}-\U{FF3A}])\\s*',"")

To understand the regex:

The character class(represented in square brackets) contains the UTF range of upper case and lower case LATIN capital letters (which I have found in the site mentioned above). I have put them into regex lookaround assertion along with \s which denotes spaces. I have matched the spaces and then replaced them with nothing. This way, I got your result something like below. I hope this what you are expecting. Also since you can't see this on your console, you can use str_view_all function to see these alphabets when translated into html. I have copied and pasted the results from that only.

服务 接口 知识 组织 开放 查询 语义 推理 Web 服务
Solr 分面 搜索 标准 信息管理
语义 Wiki标注 导航 检索 SemanticMediaWikiPAUXIkeWiki
Liferay 主从 模式 集成 知识 平台
数据 摄取 SKE 本体 属性 映射 三元组 存储

I hope this explains solution of your problem in detail. Thank you !!!

After OP's comment, it seems he want to replace the wide latin english form to normal letters,an external file is used for unicode replacement, this file(NamesList.txt) can be found at this link

library(stringr)
library(Unicode) ##Unicode is a beautiful library having lot of great functions such as u_char_from_name which is used here.
rd_dt <- readLines("NamesList.txt",encoding="UTF-8")

  ##cleaning of Nameslist.txt which has unicode values against wide latin alphabet
rd_dt1 <- rd_dt[grep("[[:alnum:]]{4}\t.*",rd_dt)]

rd_dt1 <- read.delim(textConnection(rd_dt1),sep="\t",stringsAsFactors = F)
rd_dt1 <- rd_dt1[,1:2]
names(rd_dt1) <- c("UTF_8_values","Symbol")
rd_dt1 <- rd_dt1[grep("LATIN",rd_dt1$Symbol),]
rd_dt1 <- rd_dt1[grep("WIDTH",rd_dt1$Symbol),]
value <- substr(rd_dt1$Symbol,nchar(trimws(rd_dt1$Symbol)),nchar(trimws(rd_dt1$Symbol)))
rd_dt1$value <- value
###Assigning captial and small english letter to their corresponding latin wide small and captial letters
letters <-  grepl("CAPITAL",rd_dt1$Symbol)+0
captial_small <- ifelse(letters==1,toupper(rd_dt1$value),tolower(rd_dt1$value))
rd_dt1$capital_small <- captial_small
rd_dt1 <- rd_dt1[,c(1,2,4)]
### From OP's source taking the text which is non english and it is wide latin text
dt <- c('SemanticMediaWikiPAUXIkeWiki')
###Check of the contents between UTF values of OP's text content and the UTF-8 text files
as.u_char(utf8ToInt(dt)) %in% u_char_from_name(rd_dt1$Symbol)

Final Answer For the conversion:

paste0(rd_dt1[match(utf8ToInt(dt),u_char_from_name(rd_dt1$Symbol)),"capital_small"],collapse="")

Result:

> paste0(rd_dt1[match(utf8ToInt(dt),u_char_from_name(rd_dt1$Symbol)),"capital_small"],collapse="")
[1] "SemanticMediaWikiPAUXIkeWiki"

CAVEAT: The above code is working well with MACOSX Sierra, and R-3.3 however on windows, automatically on R studio console everything converts into the corresponding english text and I am unable to see the UTF-8 codes against these texts. I am not able to determine the reason.

EDIT:

I recently found that there is a function called stri_trans_general in stringi library which can do this task in much efficient way, once the spaces are removed using regex as mentioned above we can directly translate the latin wide alphabet by using below code:

dt <- c('SemanticMediaWikiPAUXIkeWiki')

stringi::stri_trans_general(dt, "latin-ascii")

The answer remains the same as mentioned above.

like image 125
PKumar Avatar answered Nov 15 '22 06:11

PKumar