Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SSE 4.2 CSV file parsing

I'm currently investigating how to use the SSE 4.2 String and Text Processing Instructions STTNI (http://software.intel.com/en-us/articles/xml-parsing-accelerator-with-intel-streaming-simd-extensions-4-intel-sse4/) for efficient CSV file parsing.

My question is if this has been tried before for CSV file/in-memory CSV parsing and if examples are available online? So far I was not successful in finding good resources (except the Intel article mentioned above) on how to use SSE 4.2 for text parsing.

The current strategy I'm trying is to, for each 16 bytes, create 4 bitmasks:

  • one matching each character against the delimiter
  • one matching each character against the newline character
  • one matching each character against the quotation character (strings); and
  • one matching each character against the escape character (escaping delimiter, newlines, quotes)

with the information gained by the bitmasks it is easy to determine the offsets and lengths for each value in the CSV.

like image 510
muehlbau Avatar asked Sep 28 '12 12:09

muehlbau


1 Answers

Why are you using the bitmasks? Wouldn't it be better to check for all of those events with a single STTNI instruction and then use the returned index to process the event returned (if any)?

(edit) let me try to be more helpful...

(I'll assume you are using null terminated strings of 8-bit chars. Let me know if that is not the case.)

I think you'd do better to put the delimiter, the newline, the quotation and the escape into a single register (as a null terminated string) and use PCMPISTRI instead of PCMPISTRM using each value. For the control word you'll want to indicate: Unsigned bytes, Equal Any, Positive Polarity, Least. (Pretty sure I got that right.)

You can then use JA to simultaneously check to see if any of the 4 special characters were hit or the end of the string was reached. If so, escape the loop to deal with it. If not, add ECX to the xmm2/m128 pointer and jump back to the PCMPISTRI.

First instruction of code to deal with a "hit" is to add ECX to xmm2/m128 pointer, then process each possibility in turn. I suggest ordering them from most likely to least.

So, the asm should end up looking something like:

  XOR       ECX, ECX  

TAG1:
    ADD       EAX, ECX  
    PCMPISTRI XMM1, [EAX], 0x0     ; also writes ECX = index
    JA        TAG1  

ADD       EAX, ECX  
CMP       BYTE PTR[EAX], "delimiter"  
JE        "handle delimiter"  
CMP       BYTE PTR[EAX], "newline"  
JE        "handle newline"  
CMP       BYTE PTR[EAX], "quotation"  
JE        "handle quotation"  
CMP       BYTE PTR[EAX], "escape"  
JE        "handle escape"  
CMP       BYTE PTR[EAX], "end of string"  
JE        "handle end of string"  

I'll let you decide what the best order for testing delimiters is. :)

When I was developing the instructions I used to be able to get the compiler to generate the asm code above using intrinsics. It's been a while since I've done work with the instructions though so not sure if the average compiler will do well or not. (would be interesting to hear what results you get.)


By the way, the mask versions of the instructions do have all kinds of uses, they just aren't the best choice for finding the first or last of something since the "I" versions of the instructions will calculate the offset for you. The mask versions are good for counting or only processing certain items among other more exotic things. right now I'm using them to count A, C G, and T's in DNA strings.

like image 122
Mike Julier Avatar answered Oct 14 '22 09:10

Mike Julier