I'm rather new to R (and in law school so this is all very new to me), so apologies if this is poorly worded. I have a series of about 1500 documents that I am importing into R to categorize and analyze later. The first thing that I need to do is exclude all documents that are written in French, which are labelled with an "FR" in the title/doc.info. I was curious what kind of code I could use to exclude that before importing the files to have a clean data set before analyzing anything (since it will obvious make a mess of processes like sentiment analysis). Any help is appreciated (even if that help is explaining how to better talk about coding). Kind regards!
edit 1 The code that I am using is readtext(folder), which you can see below: folder<-"C:/[pathway]" submissions<-readtext(folder)
submissions_text<-submissions$text
submission_number<- numeric()
submission_person<- factor()
submission_code<- factor()
submission_language<-factor()
submission_location<-factor()
for (submission_name in submissions$doc_id) {
submission_name<-gsub(".txt","",submission_name)
number<-as.numeric(strsplit(submission_name, "_|-")[[1]][1])
submission_number<-c(submission_number,number)
person<-strsplit(submission_name, "_")[[1]][2]
submission_person<-c(submission_person, person)
code<-strsplit(submission_name, "_")[[1]][3]
submission_code<-c(submission_code, code)
lang<-strsplit(submission_name, "_")[[1]][4]
submission_language<-c(submission_language, lang)
location<-strsplit(submission_name, "_")[[1]][5]
submission_location<-c(submission_location, location)
}
submissions<-cbind(submissions,submission_number)
submissions<-cbind(submissions,submission_person)
submissions<-cbind(submissions,submission_code)
submissions<-cbind(submissions,submission_language)
submissions<-cbind(submissions,submission_location)
submissions<-submissions[order(submissions$submission_number, decreasing = FALSE),]
This is just the organizational aspect of my code. I am looking to hopefully exclude all of the French data before this point (but if it comes afterward, I would also be more than happy with that).
To do so, create a text file with the name of the files and directories you want to exclude. Then, pass the name of the file to the --exlude-from option. The command looks like this: rsync -av --exclude-from= {'list.txt'} sourcedir/ destinationdir/. The rsync tool skips all files and directories you list in the file.
Alternatively, using an exclude file is convenient when there’s a relatively large directory tree with thousands of files and directories. If you have a few years of experience in the Linux ecosystem, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.
Using an Exclude File Alternatively, we can provide the tar command a file containing the list of files or directories to exclude when creating or extracting archive files. This file is called an exclude file. Let’s see how to use an exclude file to ignore specific files and directories while archiving.
rsync -av --exclude= {'*.txt','dir3','dir4'} sourcedir/ destinationdir/ The output shows that the listed files and directories are excluded from the transfer. When you need to exclude a large number of different files and directories, you can use the rsync --exclude-from flag.
The functionality you are after can be found in the list.files()
function. Documentation can be found here.
In short, your code will likely end up looking something like this:
setwd("c:/path/to/your/data/here")
files <- list.files()
non_french_files <- files[!grepl("FR", files)]
lapply(non_french_files, function(x) {
f <- read.csv(x)
#do stuff with f
}]
Note - you could directly leverage the pattern
parameter found in `list.files(), but I chose to do that in two steps in case you wanted to do something else with the French files. This also simplifies what each line of code is doing...
...good luck and welcome to R!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With