Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make a csv row for each 2 lines in a txt file

Tags:

python

bash

r

sed

awk

I have a text file like this:

Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz
Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz
Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz
Tomato mottle virus

And I need to get a csv file like this:

Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus

Because later I want to use this like a tuple to find the compressed file, read it and get a final file with names like:

Viruses/GCF_000837105.1/Tomato mottle virus.fna

I just need to learn how to do the first part of the problem. It could by with:

  • sed
  • awk
  • R
  • Python

Any help would be very appreciated. This is hard for me to accomplish because the original filenames are very messed up.

I have tried this:

sed -z 's/\n/,/g;s/,$/\n/' multi_headers

However it put comma in all \n.

like image 874
Paulo Sergio Schlogl Avatar asked Oct 13 '25 03:10

Paulo Sergio Schlogl


2 Answers

Using any awk in any shell on every Unix box and only storing 1 line at a time in memory so it'll work no matter how large your input file is:

$ awk '{ORS=(NR%2 ? "," : RS)} 1' file
Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus

There's a lot happening in a small amount of code above so here's an explanation:

  • ORS is the builtin variable containing the string to be printed at the end of each output record (record = line in this case), a newline by default.
  • RS is the builtin variable containing the string (or regexp) that separates each input record, a newline by default.
  • NR is the builtin variable containing the current record/line number so NR%2 is 1 for odd numbered records and 0 for even numbered.
  • NR%2 ? "," : RS is a ternary expression resulting in , for odd numbered lines, \n (or whatever else you have set RS to, e.g. \r\n) for even numbered.
  • 1 is a true condition which causes the default action of printing the current record to be executed.

So the above script says "if the current line number is odd print it with a , at the end, otherwise print it with a newline at the end", hence it's joining every pair of lines with a , between.

like image 200
Ed Morton Avatar answered Oct 14 '25 19:10

Ed Morton


Bash

You can do a paste (thanks @glenn jackman for pointing out my previous useless use of cat).

# or cat mytext.txt | paste -d "," - -
paste -d "," - - < mytext.txt 

Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus

R

The R function is also paste, together with sapply:

mytext <- scan("mytext.txt", character(), sep = "\n")

sapply(seq(1, length(mytext), 2), function(x) paste(mytext[x], mytext[x + 1], sep = ","))
[1] "Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A"
[2] "Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA"           
[3] "Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus"   
like image 35
benson23 Avatar answered Oct 14 '25 18:10

benson23