Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ffmpeg delay in decoding h264

Tags:

python

ffmpeg

I am taking raw RGB frames, encoding them to h264, then decoding them back to raw RGB frames.

[RGB frame] ------ encoder ------> [h264 stream] ------ decoder ------> [RGB frame]
              ^               ^                    ^               ^
        encoder_write    encoder_read        decoder_write    decoder_read

I would like to retrieve the decoded frames as soon as possible. However, it seems that there is always a one-frame delay no matter how long one waits.¹ In this example, I feed the encoder a frame every 2 seconds:

$ python demo.py 2>/dev/null
time=0 frames=1 encoder_write
time=2 frames=2 encoder_write
time=2 frames=1 decoder_read   <-- decoded output is delayed by extra frame
time=4 frames=3 encoder_write
time=4 frames=2 decoder_read
time=6 frames=4 encoder_write
time=6 frames=3 decoder_read
...

What I want instead:

$ python demo.py 2>/dev/null
time=0 frames=1 encoder_write
time=0 frames=1 decoder_read   <-- decode immediately after encode
time=2 frames=2 encoder_write
time=2 frames=2 decoder_read
time=4 frames=3 encoder_write
time=4 frames=3 decoder_read
time=6 frames=4 encoder_write
time=6 frames=4 decoder_read
...

The encoder and decoder ffmpeg processes are run with the following arguments:

encoder: ffmpeg -f rawvideo -pix_fmt rgb24 -s 224x224 -i pipe: \
                -f h264 -tune zerolatency pipe:

decoder: ffmpeg -probesize 32 -flags low_delay \
                -f h264 -i pipe: \
                -f rawvideo -pix_fmt rgb24 -s 224x224 pipe:

Complete reproducible example below. No external video files needed. Just copy, paste, and run python demo.py 2>/dev/null!

import subprocess
from queue import Queue
from threading import Thread
from time import sleep, time
import numpy as np

WIDTH = 224
HEIGHT = 224
NUM_FRAMES = 256

def t(epoch=time()):
    return int(time() - epoch)

def make_frames(num_frames):
    x = np.arange(WIDTH, dtype=np.uint8)
    x = np.broadcast_to(x, (num_frames, HEIGHT, WIDTH))
    x = x[..., np.newaxis].repeat(3, axis=-1)
    x[..., 1] = x[:, :, ::-1, 1]
    scale = np.arange(1, len(x) + 1, dtype=np.uint8)
    scale = scale[:, np.newaxis, np.newaxis, np.newaxis]
    x *= scale
    return x

def encoder_write(writer):
    """Feeds encoder frames to encode"""
    frames = make_frames(num_frames=NUM_FRAMES)
    for i, frame in enumerate(frames):
        writer.write(frame.tobytes())
        writer.flush()
        print(f"time={t()} frames={i + 1:<3} encoder_write")
        sleep(2)
    writer.close()

def encoder_read(reader, queue):
    """Puts chunks of encoded bytes into queue"""
    while chunk := reader.read1():
        queue.put(chunk)
        # print(f"time={t()} chunk={len(chunk):<4} encoder_read")
    queue.put(None)

def decoder_write(writer, queue):
    """Feeds decoder bytes to decode"""
    while chunk := queue.get():
        writer.write(chunk)
        writer.flush()
        # print(f"time={t()} chunk={len(chunk):<4} decoder_write")
    writer.close()

def decoder_read(reader):
    """Retrieves decoded frames"""
    buffer = b""
    frame_len = HEIGHT * WIDTH * 3
    targets = make_frames(num_frames=NUM_FRAMES)
    i = 0
    while chunk := reader.read1():
        buffer += chunk
        while len(buffer) >= frame_len:
            frame = np.frombuffer(buffer[:frame_len], dtype=np.uint8)
            frame = frame.reshape(HEIGHT, WIDTH, 3)
            psnr = 10 * np.log10(255**2 / np.mean((frame - targets[i])**2))
            buffer = buffer[frame_len:]
            i += 1
            print(f"time={t()} frames={i:<3} decoder_read  psnr={psnr:.1f}")

cmd = (
    "ffmpeg "
    "-f rawvideo -pix_fmt rgb24 -s 224x224 "
    "-i pipe: "
    "-f h264 "
    "-tune zerolatency "
    "pipe:"
)
encoder_process = subprocess.Popen(
    cmd.split(), stdin=subprocess.PIPE, stdout=subprocess.PIPE
)

cmd = (
    "ffmpeg "
    "-probesize 32 "
    "-flags low_delay "
    "-f h264 "
    "-i pipe: "
    "-f rawvideo -pix_fmt rgb24 -s 224x224 "
    "pipe:"
)
decoder_process = subprocess.Popen(
    cmd.split(), stdin=subprocess.PIPE, stdout=subprocess.PIPE
)

queue = Queue()

threads = [
    Thread(target=encoder_write, args=(encoder_process.stdin,),),
    Thread(target=encoder_read, args=(encoder_process.stdout, queue),),
    Thread(target=decoder_write, args=(decoder_process.stdin, queue),),
    Thread(target=decoder_read, args=(decoder_process.stdout,),),
]

for thread in threads:
    thread.start()

¹ I did some testing and it seems the decoder is waiting for the next frame's NAL header 00 00 00 01 41 88 (in hex) before it decodes the current frame. One would hope that the prefix 00 00 00 01 would be enough, but it also waits for the next two bytes!

² Prior revision of question.

like image 258
Mateen Ulhaq Avatar asked Feb 29 '20 07:02

Mateen Ulhaq


Video Answer


1 Answers

Add -probesize 32 to your decoder arguments.

  • Set decoder command to:

    cmd = "ffmpeg -probesize 32 -f h264 -i pipe: -f rawvideo -pix_fmt rgb24 -s 224x224 pipe:"
    

I found the solution here: How to minimize the delay in a live streaming with FFmpeg.

According to FFmpeg StreamingGuide:

Also setting -probesize and -analyzeduration to low values may help your stream start up more quickly.

After adding -probesize 32 argument, I am getting 9 lines of Decoder written 862 bytes... instead of about 120 lines.


Update:

I could not find a solution, but I managed to form a simple demonstration of the problem.

Instead of using two sub-processes and 4 threads, the code sample uses one sub-process and no Python threads.

The sample uses the following "filter graph":

 _________              ______________            _________
| BMP     |            |              |          | BMP     |
| encoded |  demuxer   | encoded data |  muxer   | encoded |
| frames  | ---------> | packets      | -------> | frames  |
|_________|            |______________|          |_________|
input PIPE                                       output PIPE

See: Stream copy chapter

I figure out that for "pushing" the first frame from the input to the output, we need to write at least additional 4112 bytes from the beginning of the second frame.

Here is the code sample:

import cv2
import numpy as np
import subprocess as sp

width, height, n_frames, fps = 256, 256, 10, 1  # 10 frames, resolution 256x256, and 1 fps


def make_bmp_frame_as_bytes(i):
    """ Build synthetic image for testing, encode as BMP and convert to bytes sequence """
    p = width//50
    img = np.full((height, width, 3), 60, np.uint8)
    cv2.putText(img, str(i+1), (width//2-p*10*len(str(i+1)), height//2+p*10), cv2.FONT_HERSHEY_DUPLEX, p, (255, 30, 30), p*2)  # Blue number

    # BMP Encode img into bmp_img
    _, bmp_img = cv2.imencode(".BMP", img)
    bmp_img_bytes = bmp_img.tobytes()

    return bmp_img_bytes



# BMP in, BMP out:
process = sp.Popen(f'ffmpeg -debug_ts -probesize 32 -f bmp_pipe -framerate {fps} -an -sn -dn -i pipe: -f image2pipe -codec copy -an -sn -dn pipe:', stdin=sp.PIPE, stdout=sp.PIPE)

# Build image (number -1) before the loop.
bmp_img_bytes = make_bmp_frame_as_bytes(-1)

# Write one BMP encoded image before the loop.
process.stdin.write(bmp_img_bytes)
process.stdin.flush()

for i in range(n_frames):
    # Build image (number i) before the loop.
    bmp_img_bytes = make_bmp_frame_as_bytes(i)

    # Write 4112 first bytes of the BMP encoded image.
    # Writing 4112 "push" forward the previous image (writing less than 4112 bytes hals on the first frame).
    process.stdin.write(bmp_img_bytes[0:4112])
    process.stdin.flush()

    # Read output BMP encoded image from stdout PIPE.
    buffer = process.stdout.read(width*height*3 + 54)   # BMP header is 54 bytes
    buffer = np.frombuffer(buffer, np.uint8)
    frame = cv2.imdecode(buffer, cv2.IMREAD_COLOR)  # Decode BMP image (using OpenCV).

    # Display the image
    cv2.imshow('frame', frame)
    cv2.waitKey(1000)

    # Write the next bytes of the BMP encoded image (from byte 4112 to the end).
    process.stdin.write(bmp_img_bytes[4112:])
    process.stdin.flush()


process.stdin.close()
buffer = process.stdout.read(width*height*3 + 54)   # Read last image
process.stdout.close()

# Wait for sub-process to finish
process.wait()

cv2.destroyAllWindows()
  • I have no idea why 4112 bytes.
    I used FFmpeg version 4.2.2, statically linked (ffmpeg.exe) under Windows 10.
    I didn't check if 4112 bytes is persistent for other versions / platforms.
  • I suspect the "latency issue" is inherent to FFmpeg Demuxers.
  • I could not find any argument/flag to prevent the issue.
  • The rawvideo demuxer is the only demuxer (I found) that didn't add latency.

I hope the simpler sample code helps finding a solution to the latency issue...


Update:

H.264 stream example:

The sample uses the following "filter graph":

 _________              ______________              _________ 
| H.264   |            |              |            |         |
| encoded |  demuxer   | encoded data |  decoder   | decoded |
| frames  | ---------> | packets      | ---------> | frames  |
|_________|            |______________|            |_________|
input PIPE                                         output PIPE

The code sample writes AUD NAL unit after writing each encoded frame.

The AUD (Access Unit Delimiter) is an optional NAL unit that comes at the beginning the encoded frame.
Apparently, writing AUD after the writing the encoded frame "pushes" the encoded frames from the demuxer to the decoder.

Here is a code sample:

import cv2
import numpy as np
import subprocess as sp
import json

width, height, n_frames, fps = 256, 256, 100, 1  # 100 frames, resolution 256x256, and 1 fps


def make_raw_frame_as_bytes(i):
    """ Build synthetic "raw BGR" image for testing, convert the image to bytes sequence """
    p = width//60
    img = np.full((height, width, 3), 60, np.uint8)
    cv2.putText(img, str(i+1), (width//2-p*10*len(str(i+1)), height//2+p*10), cv2.FONT_HERSHEY_DUPLEX, p, (255, 30, 30), p*2)  # Blue number

    raw_img_bytes = img.tobytes()

    return raw_img_bytes


# Build input file input.264 (AVC encoded elementary stream)
################################################################################
process = sp.Popen(f'ffmpeg -y -video_size {width}x{height} -pixel_format bgr24 -f rawvideo -r {fps} -an -sn -dn -i pipe: -f h264 -g 1 -pix_fmt yuv444p -crf 10 -tune zerolatency -an -sn -dn input.264', stdin=sp.PIPE)

#-x264-params aud=1
#Adds [  0,   0,   0,   1,   9,  16 ] to the beginning of each encoded frame
aud_bytes = b'\x00\x00\x00\x01\t\x10'  #Access Unit Delimiter
#process = sp.Popen(f'ffmpeg -y -video_size {width}x{height} -pixel_format bgr24 -f rawvideo -r {fps} -an -sn -dn -i pipe: -f h264 -g 1 -pix_fmt yuv444p -crf 10 -tune zerolatency -x264-params aud=1 -an -sn -dn input.264', stdin=sp.PIPE)

for i in range(n_frames):
    raw_img_bytes = make_raw_frame_as_bytes(i)
    process.stdin.write(raw_img_bytes) # Write raw video frame to input stream of ffmpeg sub-process.

process.stdin.close()
process.wait()
################################################################################

# Execute FFprobe and create JSON file (showing pkt_pos and pkt_size for every encoded frame):
sp.run('ffprobe -print_format json -show_frames input.264', stdout=open('input_probe.json', 'w'))

# Read FFprobe output to dictionary p
with open('input_probe.json') as f:
    p = json.load(f)['frames']


# Input PIPE: H.264 encoded video, output PIPE: decoded video frames in raw BGR video format
process = sp.Popen(f'ffmpeg -probesize 32 -flags low_delay -f h264 -framerate {fps} -an -sn -dn -i pipe: -f rawvideo -s {width}x{height} -pix_fmt bgr24 -an -sn -dn pipe:', stdin=sp.PIPE, stdout=sp.PIPE)

f = open('input.264', 'rb')

process.stdin.write(aud_bytes)  # Write AUD NAL unit before the first encoded frame.

for i in range(n_frames-1):
    # Read H.264 encoded video frame
    h264_frame_bytes = f.read(int(p[i]['pkt_size']))

    process.stdin.write(h264_frame_bytes)
    process.stdin.write(aud_bytes)  # Write AUD NAL unit after the encoded frame.
    process.stdin.flush()

    # Read decoded video frame (in raw video format) from stdout PIPE.
    buffer = process.stdout.read(width*height*3)
    frame = np.frombuffer(buffer, np.uint8).reshape(height, width, 3)

    # Display the decoded video frame
    cv2.imshow('frame', frame)
    cv2.waitKey(1)

# Write last encoded frame
h264_frame_bytes = f.read(int(p[n_frames-1]['pkt_size']))
process.stdin.write(h264_frame_bytes)

f.close()


process.stdin.close()
buffer = process.stdout.read(width*height*3)   # Read the last video frame
process.stdout.close()

# Wait for sub-process to finish
process.wait()

cv2.destroyAllWindows()

Update:

The reason for the extra frame delay is that h264 elementary stream doesn't have an "end of frame" signal, and there is no "payload size" field in the NAL unit header.

The only way to detect when a frame ends is to see where the next one begins.

See: Detect ending of frame in H.264 video stream.
And How to know the number of NAL unit in H.264 stream which represent a picture.

For avoiding the wait for the beginning of the next frame, you must use a "transport stream" layer or video container format.
Transport streams and few container formats allow "end of frame" detection by the receiver (demuxer).

I tried using MPEG-2 transport stream, but it added a delay of one more frame.
[I didn't try RTSP protocol, because it's not working with pipes].

Using Flash Video (FLV) container reduces the delay to a single frame.
The FLV container has a "Payload Size" field in the packet header that allows the demuxer to avoid waiting for the next frame.

Commands for using FLV container and H.264 codec:

cmd = (
    "ffmpeg "
    "-f rawvideo -pix_fmt rgb24 -s 224x224 "
    "-i pipe: "
    "-vcodec libx264 "
    "-f flv "
    "-tune zerolatency "
    "pipe:"
)
encoder_process = subprocess.Popen(
    cmd.split(), stdin=subprocess.PIPE, stdout=subprocess.PIPE
)

cmd = (
    "ffmpeg "
    "-probesize 32 "
    "-flags low_delay "
    "-f flv "
    "-vcodec h264 "
    "-i pipe: "
    "-f rawvideo -pix_fmt rgb24 -s 224x224 "
    "pipe:"
)

decoder_process = subprocess.Popen(
    cmd.split(), stdin=subprocess.PIPE, stdout=subprocess.PIPE
)

In the commands above, FFmpeg uses the FLV muxer for the encoder process and FLV demuxer for the decoder process.

Output result:

time=0 frames=1   encoder_write
time=0 frames=1   decoder_read  psnr=49.0
time=2 frames=2   encoder_write
time=2 frames=2   decoder_read  psnr=48.3
time=4 frames=3   encoder_write
time=4 frames=3   decoder_read  psnr=45.8
time=6 frames=4   encoder_write
time=6 frames=4   decoder_read  psnr=46.7

As you can see, there is no extra frame delay.

Other containers that have also worked are: AVI and MKV.

like image 57
Rotem Avatar answered Nov 15 '22 00:11

Rotem