Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a binary variable for logistic regression by using key words in text variable

Tags:

text

r

nlp

I have criminal sentencing data that contains a text variable which contains phrases like "2 months jail", "14 months prison", "12 months community supervision." I would like to run a logistic regression to determine the odds that a particular defendant is sent to prison or jail, or if they were released to community supervision. So I want to create a binary variable that shows a 1 for someone sent to "jail"/"prison" and a 0 for those sent to another program

I have tried using library(qdap) but have not had any luck. I have also tried ifelse(df$text %in% "jail", "1", "0") but it only shows 1 observation when I know there are several thousand.

Small data sample:

data<-data.frame('caseid'=c(1,2,3),'text'=c("went to prison","went to jail","released"))

  caseid           text
1      1 went to prison
2      2   went to jail
3      3       released

Trying to create a binary variable - sentenced - to analyze logistically like:

  caseid           text sentenced
1      1 went to prison         1
2      2   went to jail         1
3      3       released         0

Thank you for any help you can offer!

like image 215
CSk9 Avatar asked May 30 '26 22:05

CSk9


1 Answers

You can do the following in base R

transform(data, sentenced = +grepl("(jail|prison)", text))
#  caseid           text sentenced
#1      1 went to prison         1
#2      2   went to jail         1
#3      3       released         0

Explanation: "(jail|prison)" matches "jail" or "prison", and the unary operator + turns the output of grepl into an integer.

like image 171
Maurits Evers Avatar answered Jun 02 '26 13:06

Maurits Evers



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!