Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the decoded output of a video codec?

Folks,

I am wondering if someone can explain to me what exactly is the output of video decoding. Let's say it is a H.264 stream in an MP4 container.

From displaying something on the screen, I guess decoder can provider two different types of output:

  1. Point - (x, y) coordinate of the location and the (R, G, B) color for the pixel
  2. Rectangle (x, y, w, h) units for the rectangle and the (R, G, B) color to display

There is also the issue of time stamp.

Can you please enlighten me or point me the right link on what is generated out of a decoder and how a video client can use this information to display something on screen?

I intend to download VideoLAN source and examine it but some explanation would be helpful.

Thank you in advance for your help.

Regards, Peter

like image 632
Peter Avatar asked Aug 17 '11 22:08

Peter


1 Answers

None of the above.

Usually the output will be a stream of bytes that contains just the color data. The X,Y location is implied by the dimensions of the video.

In other words, the first three bytes might encode the color value at (0, 0), the second three byte the value at (0, 1), and so on. Some formats might use four bytes groups, or even a number of bits that doesn't add up to one byte -- for example, if you use 5 bits for each color component and you have three color components, that's 15 bits per pixel. This might be padded to 16 bits (exactly two bytes) for efficiency, since that will align data in a way that CPUs can better process it.

When you've processed exactly as many values as the video is wide, you've reached the end of that row. When you've processed exactly as many rows as the video is high, you've reached the end of that frame.

As for the interpretation of those bytes, that depends on the color space used by the codec. Common color spaces are YUV, RGB, and HSL/HSV.

It depends strongly on the codec in use and what input format(s) it supports; the output format is usually restricted to the set of formats that are acceptable inputs.

Timestamp data is a bit more complex, since that can be encoded in the video stream itself, or in the container. At a minimum, the stream would need a framerate; from that, the time of each frame can be determined by counting how many frames have been decoded already. Other approaches, like the one taken by AVI, is to include a byte-offset for every Nth frame (or just the keyframes) at the end of the file to enable rapid seeking. (Otherwise, you would need to decode every frame up to the timestamp you're looking for in order to determine where in the file that frame is.)

And if you're considering audio data too, note that with most codecs and containers, the audio and video streams are independent and know nothing about each other. During encoding, the software that writes both streams into the container format does a process called muxing. It will write out the data in chunks of N seconds each, alternating between streams. This allows whoever is reading the stream to get N seconds of video, then N seconds of audio, then another N seconds of video, and so on. (More than one audio stream might be included too -- this technique is frequently used to mux together video, and English and Spanish audio tracks into a single file that contains three streams.) In fact, even subtitles can be muxed in with the other streams.

like image 170
cdhowie Avatar answered Sep 20 '22 12:09

cdhowie