I want to use scala to parse a .mht file, but I found my code is exactly like Java.
Following is a mht
file sample:
From: <Save by Tencent MsgMgr>
Subject: Tencent IM Message
MIME-Version: 1.0
Content-Type:multipart/related;
charset="utf-8"
type="text/html";
boundary="----=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19"
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type: text/html
Content-Transfer-Encoding:7bit
<html xmlns="http://www.w3.org/1999/xhtml"><head></head>...</html>
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type:image/jpeg
Content-Transfer-Encoding:base64
Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU
FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU
FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF
BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type:image/jpeg
Content-Transfer-Encoding:base64
Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU
FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU
FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF
BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type:image/jpeg
Content-Transfer-Encoding:base64
Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU
FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU
FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF
BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
There is a special line called boundary
, which is a separator line:
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
The first part is some information about this file, which can be ignored. Following are 4 blocks, the first one is a html
file, others are jpg
images with base64
encoded text.
If I use Java, the code is like:
BufferedReader reader = new BufferedReader(new FileInputStream(new File("test.mht")))
String line = null;
String boundary = null;
// for a block
String contentType = null;
String encoding = null;
String location = null;
List<String> data = null;
while((line=reader.readLine())!=null) {
// first, get the boundary
if(boundary==null) {
if(line.trim().startsWith("boundary=\"") {
boundary = substringBetween(line, "\"", "\"");
}
continue;
}
if(line.equals("--"+boundary) { // new block
if(contentType!=null) {
// save data to a file
}
encoding=null;
contentType=null;
location = null;
data = new ArrayList<String>();
} else {
if(id==null || contentType==null || location ==null) {
if(line.trim().startsWith("Content-Type:") { /* get content type */ }
// else check encoding
// else check location
} else {
data.add(line);
}
}
}
I tried to use scala to rewrite the code, but I found the structure of my code is nearly the same, except I used the scala syntax instead of Java.
Is there a scala way to do the same work?
PS: I don't want to load the full file into memory, since the file is huge. Instead I want to read and parse it line by line.
Thanks for helping!
I'm going to explain how to build a general solution in a standard way using parser combinators. The other solution presented is much faster, but, once you understand how to do this, you can easily adapt it to other tasks.
First, what you are showing is an e-mail message. The format to such messages is defined in a bunch of RFCs. RFC-822 define basics of header and body, though it enters in considerable detail about the headers, but says nothing about the body. RFC-1521 and 1522 talks about MIME, and are, themselves, revisions of RFCs 1341 and 1342. There are many other RFCs about the subject.
The interesting thing is that they provide grammars about this stuff, so you can write parsers to decompose it correctly. Let's start with a simplified version of RFC822, pretty much ignoring all the known fields and their formats, and simply place everything in a map. I do this because the grammar is rather long, and the few lines I have here can already be compared to the ones in the RFC.
On Scala Parser combinators, every rule is separated by ~
(in the RFC, just spaces separated them), and I use <~
or ~>
sometimes to discard an uninteresting part of it. Also, I used ^^
to transform what was parsed into a data structure to be used.
import scala.util.parsing.combinator._
/** Object companion to RFC822, containing the Message class,
* and extending the trait so that it can be used as a parser
*/
object RFC822 extends RFC822 {
case class Message(header: Map[String, String], text: String)
}
/**
* Parsers `message` according to RFC-822 (http://www.w3.org/Protocols/rfc822/),
* but without breaking up the contents for each field,
* nor identifying particular fields.
*
* Also, introduces "header" to convert all fields into a map.
*/
class RFC822 extends RegexParsers {
import RFC822.Message
override def skipWhitespace = false
def message = (header <~ CRLF) ~ text ^^ {
case hd ~ txt => Message(hd, txt)
}
// this isn't part of the RFC, but we use it to generate a map
def header = field.* ^^ { _.toMap }
def field = (fieldName <~ ":") ~ fieldBody <~ CRLF ^^ { case name ~ body => name -> body }
def fieldName = """[^:\P{Graph}]+""".r
// Recursive definition needs a type
// Also, I use .+ on LWSPChar because it's specified for the lexer,
// which we are not using
def fieldBody: Parser[String] = fieldBodyContents ~ (CRLF ~> LWSPChar.+ ~> fieldBody).? ^^ {
case a ~ Some(b) => a + " " + b // reintroduces a single LWSPChar
case a ~ None => a
}
def fieldBodyContents = ".*".r
def CRLF = """\n""".r // this needs to be the regex \n pattern
def LWSPChar = " " | "\t" // these do not need to be regex
def text = "(?s).*".r // (?s) makes . match newlines
}
Now let's deal with the content type. The specification on RFC-1521 is this is implemented below. I have the word type
between backticks because it's a reserved word in Scala. Also, I'm making a semi-colon optional, because the sample you gave is missing one after defining char-set
.
object ContentType extends ContentType {
case class Content(`type`: String, subtype: String, parameter: Map[String, String])
}
class ContentType extends RegexParsers {
import ContentType.Content
// case-insensitive matching of type and subtype
def content = ("Content-Type" ~> ":" ~> `type` <~ "/") ~ subtype ~ parameters ^^ {
case t ~ s ~ p => Content(t, s, p)
}
// use this to generate a map
// *** SEMI-COLON IS NOT OPTIONAL ***
// I'm making it optional because the example is missing one
def parameters = (";".? ~> parameter).* ^^ (_.toMap)
// All values case-insensitive
def `type` = ( "(?i)application".r | "(?i)audio".r
| "(?i)image".r | "(?i)message".r
| "(?i)multipart".r | "(?i)text".r
| "(?i)video".r | extensionToken
)
def extensionToken = xToken | ianaToken
def ianaToken = failure("IANA token not implemented")
def xToken = """(?i)x-(?!\s)""".r ~ token ^^ { case a ~ b => a + b }
def subtype = token
def parameter = (attribute <~ "=") ~ value ^^ { case a ~ b => a -> b }
def attribute = token // case-insensitive
def value = token | quotedString
def token: Parser[String] = not(tspecials) ~> """\p{Graph}""".r ~ token.? ^^ {
case a ~ Some(b) => a + b
case a ~ None => a
}
// Must be in quoted-string,
// to use within parameter values
def tspecials = ( "(" | ")" | "<" | ">" | "@"
| "," | ";" | ":" | "\\" | "\""
| "/" | "[" | "]" | "?" | "="
)
// These are part of RFC822
def qtext = """[^\\"\n]""".r
def quotedPair = """\\.""".r
def quotedString = "\"" ~> (qtext|quotedPair).* <~ "\"" ^^ { _.mkString }
}
We can now use this to parse the text.
object Parser {
def apply(email: String): Option[(Map[String, String], List[String])] = {
import RFC822._
parseAll (message, email) match {
case Success(result, _) =>
if (result.header get "Content-Type" nonEmpty) Some(getParts(result))
else Some(result.header -> List(result.text))
case _ => None
}
}
def getParts(message: RFC822.Message): (Map[String, String], List[String]) = {
import ContentType._
parseAll (content, "Content-Type: " + message.header("Content-Type")) match {
case Success(Content("multipart", _, parameters), _) =>
// The ^.* part eats starting characters; it doesn't seem to be
// as spec'ed, but the sample has two extra dashes at the start
// of the line
val parts = message.text split ("^.*?\\Q" + parameters("boundary") + "\\E")
val bodies = flatMap this.apply flatMap (_._2)
message.header -> bodies.toList
case _ => message.header -> List(message.text)
}
}
}
You can then use it like Parser(email)
.
Again, I'm not proposing you use this solution for your current problem! But learning this might help you in the future.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With