Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to edit or modify or change a single line in a large text file with R

Tags:

r

i'm reading some large text files into databases with R but they contain illegal field names for the database software. the column names of the large text files are just in the first row -- is it possible to edit only that first row without cycling through every single row in the file (which seems like a waste of resources)?

here are two examples of what i'm trying to do with some example data. the first reads everything into ram - so that won't work for my large data tables. the second would work, but it's slow because it processes every line in the file.

i suppose it's important that the solution work across platforms and not require the installation of external software (aside from R packages), just because i'll be sharing this script with others and would rather not ask them to perform more steps than necessary. i'm looking for the fastest way to do this within R only :)

# create two temporary files
tf <- tempfile() ; tf2 <- tempfile()

# write the mtcars data table to a file on the disk
write.csv( mtcars , tf )

# look at the first three lines
readLines( tf , n = 3 )

# read in the entire table
z <- readLines( tf )

# make the only substitution i care about
z[1] <- gsub( 'disp' , 'newvar' , z[1] )

# write the entire table back out to the table
writeLines( z , tf2 )

# confirm the replacement
readLines( tf2 , 2 )
# done!

# # # # # # # OR

# blank out the output file
file.remove( tf2 )

# create a file connection to the text file
incon <- file( tf , "r" )

# create a second file connection to the secondary temporary file
outcon <- file( tf2 , "w" )

# read in one line at a time
while( length( one.line <- readLines( incon , 1 ) ) > 0 ){

    # make the substitution on every line
    one.line <- gsub( 'disp' , 'newvar' , one.line )

    # write each line to the second temporary file
    writeLines( one.line , outcon )
}

# close the connections
close( incon ) ; close( outcon )

# confirm the replacement
readLines( tf2 , 2 )
# done!
like image 652
Anthony Damico Avatar asked Apr 08 '13 18:04

Anthony Damico


People also ask

How do I edit a large text file?

To be able to open such large CSV files, you need to download and use a third-party application. If all you want is to view such files, then Large Text File Viewer is the best choice for you. For actually editing them, you can try a feature-rich text editor like Emacs, or go for a premium tool like CSV Explorer.

How do you edit a line in a text file in Python?

We will first open the file in read-only mode and read all the lines using readlines(), creating a list of lines storing it in a variable. We will make the necessary changes to a specific line and after that, we open the file in write-only mode and write the modified data using writelines().

Can you edit .txt files?

How to edit the text file. Simply move your mouse pointer onto the text file and double-click your left mouse button. The Windows Notepad text editor will open it for editing. Once you have edited it, you can click on 'File' and 'Save' in order to update it.

How do I edit a .RB file?

Because RB files are XML files, you can open and edit them in any text editor, including: Microsoft Notepad (Windows) Apple TextEdit (Mac)


2 Answers

Why don't you edit just the header, and then read the rest in chunks? I don't know how big this file is, but perhaps in blocks of lines (I've guessed 10000). Depending on how much memory you have you can adjust this to be bigger or smaller.

##setup
tf <- tempfile(); tf2 <- tempfile()
write.csv(mtcars,tf)

fr <- file(tf, open="rt") #open file connection to read
fw <- file(tf2, open="wt") #open file connection to write 
header <- readLines(f,n=1) #read in header
header <- gsub( 'disp' , 'newvar' , header) #modify header    
writeLines(header,con=fw) #write header to file
while(length(body <- readLines(fr,n=10000)) > 0) {
  writeLines(body,fw) #pass rest of file in chunks of 10000
}
close(fr);close(fw) #close connections
#unlink(tf);unlink(tf2) #delete temporary files

It should be faster because R will run through the while loop every 10000 lines instead of every single line. Additionally, R will call gsub on just the line you want, instead of every line, saving you R time. R can't edit a file "in-place", so to speak, so there is no way around reading and copying the file. If you have to do it in R, then make your chunks as big as memory allows and then pass your file through.

I saw a 3x performance difference between the two ways:

#test file creation ~3M lines
tf <- tempfile(); tf2 <- tempfile()
fw <- file(tf,open="wt")
sapply(1:1e6,function(x) write.csv(mtcars,fw))
close(fw)

#my way
system.time({
fr <- file(tf, open="rt") #open file connection to read
fw <- file(tf2, open="wt") #open file connection to write 
header <- readLines(f,n=1) #read in header
header <- gsub( 'disp' , 'newvar' , header) #modify header    
writeLines(header,con=fw) #write header to file
while(length(body <- readLines(fr,n=10000)) > 0) {
  writeLines(body,fw) #pass rest of file in chunks of 10000
}
close(fr);close(fw) #close connections
})    
#   user  system elapsed 
#  32.96    1.69   34.85 

#OP's way
system.time({
incon <- file( tf , "r" )
outcon <- file( tf2 , "w" )
while( length( one.line <- readLines( incon , 1 ) ) > 0 ){
    one.line <- gsub( 'disp' , 'newvar' , one.line )
    writeLines( one.line , outcon )
}
close( incon ) ; close( outcon )
})
#   user  system elapsed 
# 104.36    1.92  107.03 
like image 100
Blue Magister Avatar answered Oct 07 '22 00:10

Blue Magister


You're using the wrong tool for this. Use some command line tool instead. E.g. using sed, smth like sed -i '1 s/disp/newvar/' file should do. And if you have to do this in R, use

filename = 'myfile'
scan(pipe(paste("sed -i '1 s/disp/newvar/' ", filename, sep = "")))

Here's a windows-specific version:

filename = 'myfile'
tf1 = tempfile()
tf2 = tempfile()

# read header, modify and write to file
header = readLines(filename, n = 1)
header = gsub('disp', 'newvar', header)
writeLines(header, tf1)

# cut the rest of the file to a separate file
scan(pipe(paste("more ", filename, " +1 > ", tf2)))

# append the two bits together
file.append(tf1, tf2)

# tf1 now has what you want
like image 23
eddi Avatar answered Oct 07 '22 01:10

eddi