Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Can I Deal with C# .NET TimeSpan Progressive Rounding Error while Recording Video Frame-by-Frame?

This is a neat issue, and not a "tell me what code works", but rather a "how do I logically handle this situation" question.

I have, in short, video + audio coming in a from an IP camera via RTSP.

The video and audio are being decoded and recorded frame-by-frame to a single mp4 container, by separate threads (shown below).

The problem is video and audio becoming progressively more out of sync over time, due to lack of precision with the TimeSpan end time and start times for each video frame.

It should be a duration of 1 / framerate = 0.0333667000333667 for each video frame, but it's using (even with the FromTicks() method), start time = 0.0 and end time of 0.0333667 for the first frame.

I can adjust the video decoder framerate value from 29.97 (it's pulling that in from the camera's settings declared framerate), resulting in either video that precedes the audio, or lags after the audio--this is simply making each video mediaBuffer.StartTime and mediaBuffer.EndTime either too early or too late, in comparison to the audio.

Over time, the miniscule decimal truncation ends up making the video and audio out of sync--the longer the recording, the more out of sync the two tracks get.

I don't really understand why this is happening, because, rounding error shouldn't logically matter.

Even if I only had a precision of 1 second, I'd only write a video frame each second, and it's placement in the timeline would be roughly where it should be +- 1 second, and that should make every progressive frame the same +- 1 second to where it should be, not adding up progressively more misplacement. I'm imagining this would look like for each frame:

[<-------- -1 second --------> exact frame time expected <-------- +1s -------->] ---------------------------------------------------- recorded frame time --------

Am I missing something here?

I'm not doing "new frame start time = last frame end time, new frame end time = new frame start time + 1 / framerate"--I'm actually doing "new frame start time = frame index - 1 / framerate, new frame end time = frame index / framerate".

That is, I'm calculating the frame start and end times based on the expected time they should have (frame time = frame position / framerate).

What my code is doing is this:

time expected ---------- time expected ---------- time expected frame time frame time frame time

I understand the issue mathematically, I just don't understand why decimal truncation is proving such a problem, or logically know what the best solution is to fix it.

If I implement something that says "every x frames, use "(1 / framerate) + some amount" to make up for all the missing time, will that be possible to have frames match where they should be, or just result in messy video?

    public void AudioDecoderThreadProc()
    {
        TimeSpan current = TimeSpan.FromSeconds(0.0);

        while (IsRunning)
        {
            RTPFrame nextFrame = jitter.FindCompleteFrame();

            if (nextFrame == null)
            {
                System.Threading.Thread.Sleep(20);
                continue;
            }

            while (nextFrame.PacketCount > 0 && IsRunning)
            {
                RTPPacket p = nextFrame.GetNextPacket();

                if (sub.ti.MediaCapability.Codec == Codec.G711A || sub.ti.MediaCapability.Codec == Codec.G711U)
                {
                    MediaBuffer<byte> mediaBuffer = new MediaBuffer<byte>(p.DataPointer, 0, (int)p.DataSize);
                    mediaBuffer.StartTime = current;
                    mediaBuffer.EndTime = current.Add(TimeSpan.FromSeconds((p.DataSize) / (double)audioDecoder.SampleRate));

                    current = mediaBuffer.EndTime;

                    if (SaveToFile == true)
                    {
                        WriteMp4Data(mediaBuffer);
                    }
                }
            }
        }
    }

    public void VideoDecoderThreadProc()
    {
        byte[] totalFrame = null;

        TimeSpan current = TimeSpan.FromSeconds(0.0);
        TimeSpan videoFrame = TimeSpan.FromTicks(3336670);
        long frameIndex = 1;

        while (IsRunning)
        {
            if (completedFrames.Count > 50)
            {
                System.Threading.Thread.Sleep(20);
                continue;
            }

            RTPFrame nextFrame = jitter.FindCompleteFrame();

            if (nextFrame == null)
            {
                System.Threading.Thread.Sleep(20);
                continue;
            }

            if (nextFrame.HasSequenceGaps == true)
            {
                continue;
            }

            totalFrame = new byte[nextFrame.TotalPayloadSize * 2];
            int offset = 0;

            while (nextFrame.PacketCount > 0)
            {
                byte[] fragFrame = nextFrame.GetAssembledFrame();

                if (fragFrame != null)
                {
                    fragFrame.CopyTo(totalFrame, offset);
                    offset += fragFrame.Length;
                }
            }

            MediaBuffer<byte> mediaBuffer = new MediaBuffer<byte>(
                totalFrame,
                0,
                offset,
                TimeSpan.FromTicks(Convert.ToInt64((frameIndex - 1) / mp4TrackInfo.Video.Framerate * 10000000)),
                TimeSpan.FromTicks(Convert.ToInt64(frameIndex / mp4TrackInfo.Video.Framerate * 10000000)));

            if (SaveToFile == true)
            {
                WriteMp4Data(mediaBuffer);
            }

            lock (completedFrames)
            {
                completedFrames.Add(mediaBuffer);
            }

            frameIndex++;
        }
    }
like image 574
user1518816 Avatar asked Nov 03 '22 06:11

user1518816


2 Answers

There are a couple things you should look out for:

  1. Improper manual frame timestamping. It's usually a bad idea to calculate frame durations by hand instead of letting the driver/card/whatever give you the frame time. Stamping a frame yourself almost always leads to drift because of variable bitrates, internal computer timings, etc.

  2. Precision drift. I've run into drift when when dealing with frame timestamps that are in units of milliseconds, but my source timestamp was in units of nanoseconds. This required me to cast a double to a long.

    For example, I get a media time from directshow that is in units of nanoseconds, however my internal calculations require units of milliseconds. This means I need to convert between ns and ms. For me, thats where the precision loss was. My solution to this has been that you need to keep track of any precision loss.

    What I've done in the past is I have a running "timingFraction" counter. Basically anytime I do division, that gives me the basic timestamp for a frame (so frame Time / NS_PS_MS). However, I also add the dropped fractional part of the pre-casted timestamp to a timing fraction counter (in c++ I used the modf function). Now I add the casted timestamp (which is an integer since it was casted to long) with the remaining timing fraction if the timing fraction is an integer. Basically if you accumulated an extra millisecond, make sure to add it to the frame. This way you can compensate for any precision drift.

  3. Accordion effects. While over time everything may be adding up to the right thing, and you think that even at a 1 second granulation things should match up, they won't. The audio needs to match up perfectly or things will sound weird. This is usually characterized by you hear the right audio coming from a person at the right time, but the lips don't line up. Over time everything is still fine, but nothing quite lines up. This is because you aren't rendering the frames at the right time. Some frames are a little too long, some frames a little too short, over all everything adds up to the right spot, but nothing is the right length.

Now, why you are running into this if your precision is already at the 100 nanosecond level, it sounds to me like it's probably item 1. I would validate that you are sure you are calculating the right end timestamp before moving on.

I also sometimes run tests where I sum up the deltas between frames and make sure things are adding correctly. The sum of time between each frames for the duration of your stream should equal the time it's been streaming. I.e. frame 1 is 33 ms long, and frame 2 is 34 ms long, and you recorded for 67 ms. If you recorded for 70ms you lost something somewhere. Drifts usually show up after a few hours and are easier to detect by ear/eye when matching audio and video together.

Also, to counter Hans's comment, the audio engineering world has plenty to say on this. 10ms is plenty to hear latency especially when paired with video feedback. You may not be able to see 10ms latency but you can definitely hear it. From http://helpx.adobe.com/audition/kb/troubleshoot-recording-playback-monitoring-audition.html

General guidelines that apply to latency times

Less than 10 ms - allows real-time monitoring of incoming tracks including effects.

At 10 ms - latency can be detected but can still sound natural and is usable for monitoring.

11-20 ms - monitoring starts to become unusable, smearing of the actual sound source, >and the monitored output is apparent.

20-30 ms - delayed sound starts to sound like an actual delay rather than a component of >the original signal.

I've sort of ranted on here, but there are a lot of things at play.

like image 97
devshorts Avatar answered Nov 12 '22 12:11

devshorts


One thing that stands out is that your framerate calculation is wrong.

It should be a duration of 1 / framerate = 0.0333667000333667 for each video frame

That's when you use 29.97 as a framerate. 29.97 is merely a display value. The actual framerate is 30 / 1.001 = 29.97002997002997 FPS. One frame therefore lasts 1 / (30 / 1.001) = 0.0333666666666667 seconds. Source, see '60i'.

like image 30
ErikHeemskerk Avatar answered Nov 12 '22 12:11

ErikHeemskerk