Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to properly concatenate bidi strings in r?

I want to add markup to (Urdu language) text that is written right to left. I am trying to use gsub for the purpose but everything I have tried so far does not produce the desired output

text <- "یہ جملہ ایک مثال کے لیے استعمال کیا جا رہا ہے"
pattern <- "کیا جا"
replaceWith <- paste0("<somemark>", pattern, "</somemark>")
gsub(pattern, replaceWith, text)

gsub returns the following

یہ جملہ ایک مثال کے لیے استعمال <somemark>کیا جا</somemark> رہا ہے

desired output .

enter image description here

How can I acheive the desired output?

Note: I could not even properly typeset the desired output in my post, I had to rely on an image instead.

Update: Although mysub function below correctly concatenates the strings(in console), I continue to face the problem of incorrect order of text in shiny app.

mysub <- function(text, pattern){
beforePattern <- substr(text, 1, regexpr(pattern, text)[1]-1)
afterPattern <- substr(text, regexpr(pattern,text)[1] + nchar(pattern), nchar(text))
result <- paste(afterPattern, replaceWith, beforePattern)
result
}
like image 661
Imran Ali Avatar asked Nov 11 '16 03:11

Imran Ali


2 Answers

There is actually no problem with gsub:

text <- dput("یہ جملہ ایک مثال کے لیے استعمال کیا جا رہا ہے")
"<U+06CC><U+06C1> <U+062C><U+0645><U+0644><U+06C1> <U+0627><U+06CC><U+06A9>
<U+0645><U+062B><U+0627><U+0644> <U+06A9><U+06D2> <U+0644><U+06CC><U+06D2> 
<U+0627><U+0633><U+062A><U+0639><U+0645><U+0627><U+0644> <U+06A9><U+06CC>
<U+0627> <U+062C><U+0627> <U+0631><U+06C1><U+0627> <U+06C1><U+06D2>"

pattern <- dput("کیا جا")
"<U+06A9><U+06CC><U+0627> <U+062C><U+0627>"

replaceWith <- dput(paste0("<somemark>", pattern, "</somemark>"))
"<somemark><U+06A9><U+06CC><U+0627> <U+062C><U+0627></somemark>"

dput(gsub(pattern, replaceWith, text))
"<U+06CC><U+06C1> <U+062C><U+0645><U+0644><U+06C1> <U+0627><U+06CC><U+06A9> 
<U+0645><U+062B><U+0627><U+0644> <U+06A9><U+06D2> <U+0644><U+06CC><U+06D2> 
<U+0627><U+0633><U+062A><U+0639><U+0645><U+0627><U+0644> <somemark><U+06A9>
<U+06CC><U+0627> <U+062C><U+0627></somemark> <U+0631><U+06C1><U+0627> 
<U+06C1><U+06D2>"

The rendering of the result ( a string containing both right to left and left to right characters) is also quite logical to me:

  1. The beginning of the string contains right to left characters so is rendered from right to left

یہ جملہ ایک مثال کے لیے استعمال

  1. then the string continues with left to right characters. It is rendered left to right and added at the end (the left of what was previously rendered),

یہ جملہ ایک مثال کے لیے استعمال <somemark>

  1. then the string continues with right to left characters. It is rendered right to left and added at the end,

یہ جملہ ایک مثال کے لیے استعمال <somemark>کیا جا

  1. then the string continues with left to right characters. It is rendered left to right and added at the end,

یہ جملہ ایک مثال کے لیے استعمال <somemark>کیا جا</somemark>

  1. and finally the string ends with right to left characters. It is rendered right to left and added at the end.

یہ جملہ ایک مثال کے لیے استعمال <somemark>کیا جا</somemark> رہا ہے

Your idea of what should be rendered doesn't seem to me more logical, but I must admit I don't have experience with right to left text rendering.

Anyway, if the formatting has to be interpreted by the renderer like the <b>...</b> tags in HTML, then it works perfectly (in markdown/html):

یہ جملہ ایک مثال کے لیے استعمال <b>کیا جا</b> رہا ہے

renders as

یہ جملہ ایک مثال کے لیے استعمال کیا جا رہا ہے

I have not managed to print nothing in shiny but question marks:

???? ???????? ?????? ???????? ???? ?????? ?????????????? <somemark>?????? ????</somemark> ?????? ????

like image 120
HubertL Avatar answered Oct 21 '22 22:10

HubertL


I gave it a try . I did take the liberty of hard coding the args instead of reading from session, though.

Server: 

output$mysub <- function(){ # (text=NULL, pattern=NULL)

text <- "یہ جملہ ایک مثال کے لیے استعمال کیا جا رہا ہے"
pattern <- "کیا جا"

Encoding(text) <- "UTF-8"
Encoding(pattern) <- "UTF-8"

print(text)

beforePattern <- substr(text, 1, regexpr(pattern, text)[1]-1)
afterPattern <- substr(text, regexpr(pattern,text)[1] + nchar(pattern), nchar(text))

replaceWith <- paste0("<somemark>", pattern, "</somemark>")
result <- paste(afterPattern, replaceWith, beforePattern)

# result <- paste( beforePattern, replaceWith, afterPattern)
# Encoding(result) <- "UTF-8"
print(length(result))
print(result)

return(result)
}


# ui.R: 

h2( textOutput("mysub") )

The output I got on shiny webpage was : bidi text output

like image 33
R.S. Avatar answered Oct 21 '22 23:10

R.S.