I'm trying to understand how Android MediaExtractor parses H264 (contained in a container format).
If I examine the H264 stream, I see that it consists of NAL units demarcated by the sequence 00 00 00 01.
The samples returned by MediaExtractor are exactly those NAL units, each beginning with that marker -- except that, for the particular data source, the first three NAL units are concatenated. The first two NAL units are very short (29 and 8 bytes).
Why does that concatenation happen? If I were to parse the H264 by hand, how would I know to do that concatenation?
For the first three NAL units, the byte following the start code prefix is 103, 104, and 101 decimal. For most of the following NAL units, it's 65, and occasionally 101.
Your question can be answered by understanding the way that an h264 stream is formatted.
Android expects two configuration units entitled Sequence Parameter Set (SPS) and Picture Parameter Set (PPS) before any IDR/non-IDR frames (commonly referred to as iFrames and pFrames).
The first two NAL Units are concatenated merely for convenience. The hardware codec is able to ascertain that these frames are unique and configures itself according to their values. The third Unit is included to allow the codec to start working as soon as this configuration is complete.
TLDR; Decoding a raw stream like this by hand wouldn't require this structure. Instead you would just analyze each NAL Unit individually.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With