How to make a csv row for each 2 lines in a txt file

Question

I have a text file like this:

Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz
Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz
Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz
Tomato mottle virus

And I need to get a csv file like this:

Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus

Because later I want to use this like a tuple to find the compressed file, read it and get a final file with names like:

Viruses/GCF_000837105.1/Tomato mottle virus.fna

I just need to learn how to do the first part of the problem. It could by with:

sed
awk
R
Python

Any help would be very appreciated. This is hard for me to accomplish because the original filenames are very messed up.

I have tried this:

sed -z 's/
/,/g;s/,$/
/' multi_headers

However it put comma in all .

Ed Morton · Accepted Answer

Using any awk in any shell on every Unix box and only storing 1 line at a time in memory so it'll work no matter how large your input file is:

$ awk '{ORS=(NR%2 ? "," : RS)} 1' file
Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus

There's a lot happening in a small amount of code above so here's an explanation:

ORS is the builtin variable containing the string to be printed at the end of each output record (record = line in this case), a newline by default.
RS is the builtin variable containing the string (or regexp) that separates each input record, a newline by default.
NR is the builtin variable containing the current record/line number so NR%2 is 1 for odd numbered records and 0 for even numbered.
NR%2 ? "," : RS is a ternary expression resulting in , for odd numbered lines, (or whatever else you have set RS to, e.g. ) for even numbered.
1 is a true condition which causes the default action of printing the current record to be executed.

So the above script says "if the current line number is odd print it with a , at the end, otherwise print it with a newline at the end", hence it's joining every pair of lines with a , between.

benson23 · Answer

Bash

You can do a paste (thanks @glenn jackman for pointing out my previous useless use of cat).

# or cat mytext.txt | paste -d "," - -
paste -d "," - - < mytext.txt 

Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus

R

The R function is also paste, together with sapply:

mytext <- scan("mytext.txt", character(), sep = "
")

sapply(seq(1, length(mytext), 2), function(x) paste(mytext[x], mytext[x + 1], sep = ","))
[1] "Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A"
[2] "Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA"           
[3] "Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus"

How to make a csv row for each 2 lines in a txt file

Tags:

python

bash

r

sed

awk

Paulo Sergio Schlogl

2 Answers

Ed Morton

Bash

R

benson23

Recent Activity

Donate For Us

How to make a csv row for each 2 lines in a txt file

Tags:

python

bash

r

sed

awk

Paulo Sergio Schlogl

2 Answers

Ed Morton

Bash

R

benson23

Related questions

Recent Activity

Donate For Us