I'm using R and I have a data.frame with nearly 2,000 entries that looks as follows: <pre class="prettyprint"><code>> head(PVs,15) LogFreq Word PhonCV FreqDev 1593 140 was CVC 5.480774 482 139 had CVC 5.438114 1681 138 zou CVVC 5.395454 1662 137 zei CVV 5.352794 1619 136 werd CVCC 5.310134 1592 135 waren CVV-CV 5.267474 620 134 kon CVC 5.224814 646 133 kwam CCVC 5.182154 483 132 hadden CVC-CV 5.139494 436 131 ging CVC 5.096834 734 130 moest CVVCC 5.054174 1171 129 stond CCVCC 5.011514 1654 128 zag CVC 4.968854 1620 127 werden CVC-CV 4.926194 1683 126 zouden CVV-CV 4.883534 </code></pre> What I want to do is to create a new data.frame that is equal to PVs, except that all entries having as a member of the "Word" column a string of character that does NOT end in either "te" or "de" removed. i.e. All words not ending in either "de" or "te" should be removed from the data.frame. I know how to slectively remove entries from data.frames using logical operators, but those work when you're setting numeric criteria. I think to do this I need to use regular expressions, but sadly R is the only programming language I "know", so I'm far from knowing what type of code to use here. I appreciate your help. Thanks in advance.

Method 1 You can use <code>grepl</code> with an appropraite regular expression. Consider the following: <pre class="prettyprint"><code>x <- c("blank","wade","waste","rubbish","dedekind","bated") grepl("^.+(de|te)$",x) [1] FALSE TRUE TRUE FALSE FALSE FALSE </code></pre> The regular expression says begin (<code>^</code>) with anything any number of times (<code>.+</code>) and then find either de or te (<code>(de|te)</code>) then end (<code>$</code>). So for your data.frame try, <pre class="prettyprint"><code>subset(PVs,grepl("^.+(de|te)$",Word)) </code></pre> Method 2 To avoid the regexp method you can use a <code>substr</code> method instead. <pre class="prettyprint"><code># substr the last two characters and test substr(x,nchar(x)-1,nchar(x)) %in% c("de","te") [1] FALSE TRUE TRUE FALSE FALSE FALSE </code></pre> So try: <pre class="prettyprint"><code>subset(PVs,substr(Word,nchar(Word)-1,nchar(Word)) %in% c("de","te")) </code></pre>

Using grep <pre class="prettyprint"><code>grep -xvE '.{17}(de|te).*' file.txt </code></pre>

Select rows from data.frame ending with a specific character string in R

Tags:

string

regex

dataframe

r

character

I'm using R and I have a data.frame with nearly 2,000 entries that looks as follows:

> head(PVs,15)
     LogFreq   Word PhonCV  FreqDev
1593     140    was    CVC 5.480774
482      139    had    CVC 5.438114
1681     138    zou   CVVC 5.395454
1662     137    zei    CVV 5.352794
1619     136   werd   CVCC 5.310134
1592     135  waren CVV-CV 5.267474
620      134    kon    CVC 5.224814
646      133   kwam   CCVC 5.182154
483      132 hadden CVC-CV 5.139494
436      131   ging    CVC 5.096834
734      130  moest  CVVCC 5.054174
1171     129  stond  CCVCC 5.011514
1654     128    zag    CVC 4.968854
1620     127 werden CVC-CV 4.926194
1683     126 zouden CVV-CV 4.883534

What I want to do is to create a new data.frame that is equal to PVs, except that all entries having as a member of the "Word" column a string of character that does NOT end in either "te" or "de" removed. i.e. All words not ending in either "de" or "te" should be removed from the data.frame.

I know how to slectively remove entries from data.frames using logical operators, but those work when you're setting numeric criteria. I think to do this I need to use regular expressions, but sadly R is the only programming language I "know", so I'm far from knowing what type of code to use here.

I appreciate your help. Thanks in advance.

496

asked Oct 22 '12 13:10

HernanLG

3 Answers

Method 1

You can use grepl with an appropraite regular expression. Consider the following:

x <- c("blank","wade","waste","rubbish","dedekind","bated")
grepl("^.+(de|te)$",x)
[1] FALSE  TRUE  TRUE FALSE FALSE FALSE

The regular expression says begin (^) with anything any number of times (.+) and then find either de or te ((de|te)) then end ($).

So for your data.frame try,

subset(PVs,grepl("^.+(de|te)$",Word))

Method 2

To avoid the regexp method you can use a substr method instead.

# substr the last two characters and test
substr(x,nchar(x)-1,nchar(x)) %in% c("de","te")
[1] FALSE  TRUE  TRUE FALSE FALSE FALSE

So try:

subset(PVs,substr(Word,nchar(Word)-1,nchar(Word)) %in% c("de","te"))

165

answered Sep 20 '22 15:09

James

I modified the data a bit so that there were words that ended in te or de.

> PV
     LogFreq   Word PhonCV  FreqDev
1593     140 blahte    CVC 5.480774
482      139    had    CVC 5.438114
1681     138 aaaade   CVVC 5.395454
1662     137    zei    CVV 5.352794
1619     136   werd   CVCC 5.310134
1592     135  waren CVV-CV 5.267474
620      134    kon    CVC 5.224814
646      133 kwamde   CCVC 5.182154
483      132 hadden CVC-CV 5.139494
436      131   ging    CVC 5.096834
734      130 moeste  CVVCC 5.054174
1171     129  stond  CCVCC 5.011514
1654     128  zagde    CVC 4.968854
1620     127 werden CVC-CV 4.926194
1683     126 zouden CVV-CV 4.883534

# Add a column to PV that you can visually check the regular expression matches.
PV$Match <- grepl(pattern = "(de|te)$", PV$Word)

# Subset PV data frame to show only TRUE matches
PV <- PV[PV$Match == FALSE, ]

The result is shown below

     LogFreq   Word PhonCV  FreqDev Match
482      139    had    CVC 5.438114 FALSE
1662     137    zei    CVV 5.352794 FALSE
1619     136   werd   CVCC 5.310134 FALSE
1592     135  waren CVV-CV 5.267474 FALSE
620      134    kon    CVC 5.224814 FALSE
483      132 hadden CVC-CV 5.139494 FALSE
436      131   ging    CVC 5.096834 FALSE
1171     129  stond  CCVCC 5.011514 FALSE
1620     127 werden CVC-CV 4.926194 FALSE
1683     126 zouden CVV-CV 4.883534 FALSE

answered Sep 18 '22 15:09

RossB

Using grep

grep -xvE '.{17}(de|te).*' file.txt

answered Sep 21 '22 15:09

Ωmega

Related questions
                            
                                Javascript RegExp match text between <a> tags
                            
                                Ruby regex extracting words
                            
                                A regex to detect string not enclosed in double quotes
                            
                                When does '.' not match in a Regex?
                            
                                htaccess compare cookie value and redirect if evaluation returns true/false
                            
                                why \b doesn't work in python re module? [duplicate]
                            
                                replace emoji unicode symbol using regexp in javascript
                            
                                How to do str_extract with base R?
                            
                                array.includes returns false using regex
                            
                                Regex Email - Ignore leading and trailing spaces?
                            
                                how to replace all Uppercase letters with spacing?
                            
                                how to remove a tag and its contents using regular expression?
                            
                                Regular expression to match 10-14 digits
                            
                                Ignore files in Mercurial using Glob syntax
                            
                                RegEx to replace special characters in a string with space ? asp.net c#
                            
                                How can I replace a backslash with a double backslash using RegExp?
                            
                                Issue with Java Regex \b
                            
                                Regex not stopping at first space
                            
                                String splitting in Python using regex
                            
                                How to remove `//<![CDATA[` and end `//]]>` with javascript from string?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With