Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I remove the last character of an R-T-L string in python?

I am trying to remove the last character of a string in a "right-to-left" language. When I do, however, the last character wraps to the beginning of the string. e.g. ותֵיהֶם]׃ becomes ותֵיהֶם]

I know that this is a fundamental issue with how I'm handling the R-T-L paradigm, but if someone could help me think through it, I'd very much appreciate it.

CODE

with open(r"file.txt","r") as f:
    for line in f:
        line = unicode(line,'utf-8')
        the_text = line.split('\t')[1]
        the_text.replace(u'\u05C3','')
like image 377
swasheck Avatar asked Oct 25 '12 22:10

swasheck


People also ask

How do I remove the last special character from a string in Python?

You can remove a character from a Python string using replace() or translate(). Both these methods replace a character or string with a given value.

How do you remove the last character of a object in Python?

Using rstrip() to remove the last character The rstrip() is a built-in Python function that returns a String copy with trailing characters removed. For example, we can use the rstrip() function with negative indexing to remove the final character of the string.

How do I remove a trailing character in Python?

Python String rstrip() Method The rstrip() method removes any trailing characters (characters at the end a string), space is the default trailing character to remove.

How do I remove special characters from the start and end of a string?

Use the JavaScript replace() method with RegEx to remove a specific character from the string. The example code snippet helps to remove comma ( , ) characters from the start and end of the string using JavaScript. var myString = ',codex,world,'; myString = myString. replace(/^,+|,+$/g, '');


1 Answers

Some characters in Unicode are always LTR, some are always RTL, and some can be either depending on their surrounding context. In addition, the display context for bidirectional text will have a "predominant" directionality (e.g. a text editor configured for mainly-English text would be predominantly LTR and have a ragged right margin, one configured for mainly-Hebrew would be predominantly RTL with a ragged left margin).

It looks like what has happened here is that when a closing square bracket character appears between two RTL characters it is rendered in its RTL form (your first example) but when it appears between a RTL and a LTR character (or at the end of the string - basically, somewhere where it doesn't have other characters of the same directionality on both sides) then it is considered to be part of whichever run of text matches the predominant direction. If you try dragging your mouse over the string to select the characters you'll see that logically the closing ] still follows the ֶם even if visually it appears to have moved.

If the second-to-last character in your string were also a Hebrew character (or other strongly RTL character) rather than a ], or if the display context was predominantly RTL, then it would appear where you expect it to.

like image 139
Ian Roberts Avatar answered Nov 14 '22 21:11

Ian Roberts