Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read FASTQ file into a Spark dataframe

I'm trying to read FASTQ files into Spark dataframes. I have some difficulties because FASTQ is a multi line format.

Example:

@seq1
AGTCAGTCGAC
+
?@@FFBFFDDH
@seq2
CCAGCGTCTCG
+
?88ADA?BDF8

Is there a way to get these data in a Spark dataframe like

+-------------+-------------+------------+
| identifier  | sequence    | quality    |
+-------------+-------------+------------+
|seq1         |AGTCAGTCGAC  |?@@FFBFFDDH |
|seq2         |CCAGCGTCTCG  |?88ADA?BDF8 |
+-------------+-------------+------------+

Thanks for your time

like image 346
John Doe Avatar asked Jan 27 '26 06:01

John Doe


1 Answers

I'd slide

import org.apache.spark.mllib.rdd.RDDFunctions._

spark.createDataset(sc.textFile(path).sliding(4, 4).map {
  case Array(id, seq, _, qual) => (id, seq, qual)
}).toDF("identifier", "sequence", "quality")


// +----------+-----------+-----------+
// |identifier|   sequence|    quality|
// +----------+-----------+-----------+
// |     @seq1|AGTCAGTCGAC|?@@FFBFFDDH|
// |     @seq2|CCAGCGTCTCG|?88ADA?BDF8|
// +----------+-----------+-----------+
like image 60
Alper t. Turker Avatar answered Jan 30 '26 06:01

Alper t. Turker