Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing multipart HTTP form data with file upload content using Scala

Tags:

parsing

scala

There are plenty of multipart/form-data file upload solutions out there, but I have not been able to find a free standing one for Scala.

Play2 has this functionality as part of the framework and Spray also supports multipart form data. Unfortunately both these appear to be fairly integrated into the rest of the toolsets (I may be wrong here).

My server has been developed using Finagle (which does not currently support multipart form data), and if possible I would like to use a free standing lib or 'roll my own' solution.

This is a typical multipart/form-data message:

--*****org.apache.cordova.formBoundary
Content-Disposition: form-data; name="value1"

First parameter content
--*****org.apache.cordova.formBoundary
Content-Disposition: form-data; name="value2"

Second parameter content
--*****org.apache.cordova.formBoundary
Content-Disposition: form-data; name="file"; filename="image.jpg"
Content-Type: image/jpeg

$%^&#$%^%#$
--*****org.apache.cordova.formBoundary--

In this example, *****org.apache.cordova.formBoundary is the form boundary, so the multipart upload contains 2 text parameters and one image (I concatenated the image data for clarity).

If someone who knows Scala better than me can give me a bit of a rundown on how to approach parsing this content, I will be very grateful.

To start with, I thought I would quickly split the content in three doing:

data.split("\\Q--*****org.apache.cordova.formBoundary\\E") foreach println

But execution is notably slow (update - this was due to warm up time). Is there a more efficient way to split the parts? My strategy is to split the content into parts, and the split the parts into sub-parts. Is this a crappy approach? I've seen similar problems being solved with state machines? What is a good functional approach. Keep in mind, I'm trying to learn a proper a approach to Scala while trying to solve the problem.

Update:

I really thought a solution to this problem would be a line or two in Scala. If someone stumbles over this question with a slick solution, please take the time to jot it down. From my understanding one could parse this message using pattern matching, parsing combinators, extraction or simply splitting the string. I'm trying to find the best way to solve this kind of problem, as a project I'm working involves a lot of natural language parsing, and I need to write my own custom parsing tools. I'm getting a good understanding of Scala, but nothing beats the advice of an expert.

It's not just about solving the problem, it's about finding the best (and hopefully simplest) possible way to solve this type of problem.

like image 593
Jack Avatar asked Mar 19 '12 10:03

Jack


3 Answers

I'm curious about how slow your "notably slow" actually is. I wrote the following simple little function to generate fake messages:

def generateFakeMessage(n: Int) = {
  val rand = new scala.util.Random(1L)
  val maxLines = 100
  val maxLength = 100

  (1 to n).map(i =>
    "--*****org.apache.cordova.formBoundary\n" +
    "Content-Disposition: form-data; name=\"value%d\"\n\n".format(i) +
    (0 to rand.nextInt(maxLines)).map(_ =>
      (0 to rand.nextInt(maxLength)).map(_ => rand.nextPrintableChar).mkString
    ).mkString("\n")
  ).mkString("\n") + "\n--*****org.apache.cordova.formBoundary--"
}

Next I created a reasonably large message to use for testing:

val data = generateFakeMessage(10000)

It ends up containing a little over half a million lines. Then I tried your regular expression:

data.split("\\Q--*****org.apache.cordova.formBoundary\\E").size

And it returns more or less instantaneously. You could probably tune the regular expression a bit, and there are cleaner approaches you could use if your data were an Iterable[String] over the lines of the message, but I don't think you're going to get better performance from a hand-rolled state machine for parsing one big String.

like image 70
Travis Brown Avatar answered Sep 28 '22 03:09

Travis Brown


For a first suggestion, this question gives two suggestions, one using a state machine, and the other using parser combinators. I'd pay especial attention to the answer using parser combinators, since these provide a very easy way to build up this sort of parser. The syntax provided in Daniel's answer should adapt very easily to your situation.

Further, you can provide more specific mappings into Scala for your particular grammar if you require. Where Daniel has:

def field = (fieldName <~ ":") ~ fieldBody <~ CRLF ^^ { case name ~ body => name -> body }

you can replace this with an alternation pattern over multiple fields (contentType|contentDisposition|....) and map each of these individually into your Scala objects.

Apologies for not having the time to write a more detailed solution here, but this should hopefully point you in the right direction!

like image 23
Submonoid Avatar answered Sep 28 '22 02:09

Submonoid


I think that your solution:

data.split("\\Q--*****org.apache.cordova.formBoundary\\E") foreach println

which is O(n) in complexity, is the best and the simplest you can get. As Travis previously said, this manipulation is not slow. As always with a multipart HTTP form, you will have to parse it one way or another and doing better to O(n) seems tricky.

Moreover, as split provides you an Iterable it is really perfect for any matching, treatment...

like image 26
Christopher Chiche Avatar answered Sep 28 '22 02:09

Christopher Chiche