Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to delete the words between two delimiters?

Tags:

python

I have a noisy data..something like

<@ """@$ FSDF >something something <more noise>

Now I just want to extract "something something". Is there a way on how to delete the text between those two delimiters "<" and ">"?

like image 368
frazman Avatar asked Jan 09 '12 05:01

frazman


People also ask

How do I remove text from two characters in Excel?

To eliminate text before a given character, type the character preceded by an asterisk (*char). To remove text after a certain character, type the character followed by an asterisk (char*). To delete a substring between two characters, type an asterisk surrounded by 2 characters (char*char).

How do you remove text from brackets in Python?

If you want to remove the [] and the () you can use this code: >>> import re >>> x = "This is a sentence. (once a day) [twice a day]" >>> re. sub("[\(\[].

How do I remove a character from a string?

Using 'str. replace() , we can replace a specific character. If we want to remove that specific character, replace that character with an empty string. The str. replace() method will replace all occurrences of the specific character mentioned.

How do I replace a string between two strings in python?

If you want to replace a string that matches a regular expression (regex) instead of perfect match, use the sub() of the re module. In re. sub() , specify a regex pattern in the first argument, a new string in the second, and a string to be processed in the third.


2 Answers

Use regular expressions:

>>> import re
>>> s = '<@ """@$ FSDF >something something <more noise>'
>>> re.sub('<[^>]+>', '', s)
'something something '

[Update]

If you tried a pattern like <.+>, where the dot means any character and the plus sign means one or more, you know it does not work.

>>> re.sub(r'<.+>', s, '')
''

Why!?! It happens because regular expressions are "greedy" by default. The expression will match anything until the end of the string, including the > - and this is not what we want. We want to match < and stop on the next >, so we use the [^x] pattern which means "any character but x" (x being >).

The ? operator turns the match "non-greedy", so this has the same effect:

>>> re.sub(r'<.+?>', '', s)
'something something '

The previous is more explicit, this one is less typing; be aware that x? means zero or one occurrence of x.

like image 114
Paulo Scardine Avatar answered Oct 16 '22 21:10

Paulo Scardine


Of course, you can use regular expressions.

import re
s = #your string here
t = re.sub('<.*?>', '', s)

The above code should do it.

like image 22
Sufian Latif Avatar answered Oct 16 '22 20:10

Sufian Latif