I have a text file like this:
Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz
Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz
Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz
Tomato mottle virus
And I need to get a csv
file like this:
Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus
Because later I want to use this like a tuple to find the compressed file, read it and get a final file with names like:
Viruses/GCF_000837105.1/Tomato mottle virus.fna
I just need to learn how to do the first part of the problem. It could by with:
Any help would be very appreciated. This is hard for me to accomplish because the original filenames are very messed up.
I have tried this:
sed -z 's/\n/,/g;s/,$/\n/' multi_headers
However it put comma in all \n
.
Using any awk in any shell on every Unix box and only storing 1 line at a time in memory so it'll work no matter how large your input file is:
$ awk '{ORS=(NR%2 ? "," : RS)} 1' file
Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus
There's a lot happening in a small amount of code above so here's an explanation:
ORS
is the builtin variable containing the string to be printed at the end of each output record (record = line in this case), a newline by default.RS
is the builtin variable containing the string (or regexp) that separates each input record, a newline by default.NR
is the builtin variable containing the current record/line number so NR%2
is 1
for odd numbered records and 0 for even numbered.NR%2 ? "," : RS
is a ternary expression resulting in ,
for odd numbered lines, \n
(or whatever else you have set RS
to, e.g. \r\n
) for even numbered.1
is a true condition which causes the default action of printing the current record to be executed.So the above script says "if the current line number is odd print it with a ,
at the end, otherwise print it with a newline at the end", hence it's joining every pair of lines with a ,
between.
You can do a paste
(thanks @glenn jackman for pointing out my previous useless use of cat
).
# or cat mytext.txt | paste -d "," - -
paste -d "," - - < mytext.txt
Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus
The R function is also paste
, together with sapply
:
mytext <- scan("mytext.txt", character(), sep = "\n")
sapply(seq(1, length(mytext), 2), function(x) paste(mytext[x], mytext[x + 1], sep = ","))
[1] "Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A"
[2] "Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA"
[3] "Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With