I'm trying to encode RGBA buffers captured from an image (RGBA) source (Desktop/Camera) into raw H264 using Windows Media Foundation, transfer them and decode the raw H264 frames received at the other end in real time. I'm trying to achieve at least 30 fps. The encoder works pretty good but not the decoder.
I understand Microsoft WMF MFTs buffer up to 30 frames before emitting the encoded/decoded data.
The image source would emit frames only when there is a change occurs and not a continuous stream of RGBA buffers, so my intention is to obtain a buffer of encoded/decoded data for each and every input buffer to the respective MFT so that I can stream the data in real time and also render it.
Both the encoder and decoder are able to emit at least 10 to 15 fps when I make the image source to send continuous changes (by stimulating the changes). The encoder is able to utilize hardware acceleration support. I'm able to achieve up to 30 fps in the encoder end, and I'm yet to implement hardware assisted decoding using DirectX surfaces. Here the problem is not the frame rate but buffering of data by MFTs.
So, I tried to drain the decoder MFT by sending the MFT_MESSAGE_COMMAND_DRAIN command, and repeatedly calling ProcessOutput until the decoder returns MF_E_TRANSFORM_NEED_MORE_INPUT. What happens now is the decoder now emits only one frame per 30 input h264 buffers, I tested it with even a continuous stream of data, and the behavior is same. Looks like the decoder drops all the intermediate frames in a GOP.
It's okay for me if it buffers only first few frames, but my decoder implementation outputs only when it's buffer is full all the time even after the SPS and PPS parsing phase.
I come across Google's chromium source code (https://github.com/adobe/chromium/blob/master/content/common/gpu/media/dxva_video_decode_accelerator.cc), they follow the same approach.
mpDecoder->ProcessMessage(MFT_MESSAGE_COMMAND_DRAIN, NULL);
My implementation is based on https://github.com/GameTechDev/ChatHeads/blob/master/VideoStreaming/EncodeTransform.cpp
and
https://github.com/GameTechDev/ChatHeads/blob/master/VideoStreaming/DecodeTransform.cpp
My questions are, Am I missing something? Is Windows Media Foundation suitable for real-time streaming?. Whether draining the encoder and decoder would work for real-time use cases?.
There are only two options for me, make this WMF work of real-time use case or go with something like Intel's QuickSync. I chose WMF for my POC because Windows Media Foundation implicitly supports Hardware/GPU/Software fallbacks in case any of the MFT is unavailable and it internally chooses best available MFT without much coding.
I'm facing video quality issues, though the bitrate property is set to 3Mbps. But it is least in priority compared to buffering problems. I have been beating my head around the keyboard for weeks, This is so hard to fix. Any help would be appreciated.
Code:
Encoder setup:
IMFAttributes* attributes = 0;
HRESULT hr = MFCreateAttributes(&attributes, 0);
if (attributes)
{
//attributes->SetUINT32(MF_SINK_WRITER_DISABLE_THROTTLING, TRUE);
attributes->SetGUID(MF_TRANSCODE_CONTAINERTYPE, MFTranscodeContainerType_MPEG4);
}//end if (attributes)
hr = MFCreateMediaType(&pMediaTypeOut);
// Set the output media type.
if (SUCCEEDED(hr))
{
hr = MFCreateMediaType(&pMediaTypeOut);
}
if (SUCCEEDED(hr))
{
hr = pMediaTypeOut->SetGUID(MF_MT_MAJOR_TYPE, MFMediaType_Video);
}
if (SUCCEEDED(hr))
{
hr = pMediaTypeOut->SetGUID(MF_MT_SUBTYPE, cVideoEncodingFormat); // MFVideoFormat_H264
}
if (SUCCEEDED(hr))
{
hr = pMediaTypeOut->SetUINT32(MF_MT_AVG_BITRATE, VIDEO_BIT_RATE); //18000000
}
if (SUCCEEDED(hr))
{
hr = MFSetAttributeRatio(pMediaTypeOut, MF_MT_FRAME_RATE, VIDEO_FPS, 1); // 30
}
if (SUCCEEDED(hr))
{
hr = MFSetAttributeSize(pMediaTypeOut, MF_MT_FRAME_SIZE, mStreamWidth, mStreamHeight);
}
if (SUCCEEDED(hr))
{
hr = pMediaTypeOut->SetUINT32(MF_MT_INTERLACE_MODE, MFVideoInterlace_Progressive);
}
if (SUCCEEDED(hr))
{
hr = pMediaTypeOut->SetUINT32(MF_MT_MPEG2_PROFILE, eAVEncH264VProfile_High);
}
if (SUCCEEDED(hr))
{
hr = MFSetAttributeRatio(pMediaTypeOut, MF_MT_PIXEL_ASPECT_RATIO, 1, 1);
}
if (SUCCEEDED(hr))
{
hr = pMediaTypeOut->SetUINT32(MF_MT_MAX_KEYFRAME_SPACING, 16);
}
if (SUCCEEDED(hr))
{
hr = pMediaTypeOut->SetUINT32(CODECAPI_AVEncCommonRateControlMode, eAVEncCommonRateControlMode_UnconstrainedVBR);//eAVEncCommonRateControlMode_Quality, eAVEncCommonRateControlMode_UnconstrainedCBR);
}
if (SUCCEEDED(hr))
{
hr = pMediaTypeOut->SetUINT32(CODECAPI_AVEncCommonQuality, 100);
}
if (SUCCEEDED(hr))
{
hr = pMediaTypeOut->SetUINT32(MF_MT_FIXED_SIZE_SAMPLES, FALSE);
}
if (SUCCEEDED(hr))
{
BOOL allSamplesIndependent = TRUE;
hr = pMediaTypeOut->SetUINT32(MF_MT_ALL_SAMPLES_INDEPENDENT, allSamplesIndependent);
}
if (SUCCEEDED(hr))
{
hr = pMediaTypeOut->SetUINT32(MF_MT_COMPRESSED, TRUE);
}
if (SUCCEEDED(hr))
{
hr = mpEncoder->SetOutputType(0, pMediaTypeOut, 0);
}
// Process the incoming sample. Ignore the timestamp & duration parameters, we just render the data in real-time.
HRESULT ProcessSample(IMFSample **ppSample, LONGLONG& time, LONGLONG& duration, TransformOutput& oDtn)
{
IMFMediaBuffer *buffer = nullptr;
DWORD bufferSize;
HRESULT hr = S_FALSE;
if (ppSample)
{
hr = (*ppSample)->ConvertToContiguousBuffer(&buffer);
if (SUCCEEDED(hr))
{
buffer->GetCurrentLength(&bufferSize);
hr = ProcessInput(ppSample);
if (SUCCEEDED(hr))
{
//hr = mpDecoder->ProcessMessage(MFT_MESSAGE_COMMAND_DRAIN, NULL);
//if (SUCCEEDED(hr))
{
while (hr != MF_E_TRANSFORM_NEED_MORE_INPUT)
{
hr = ProcessOutput(time, duration, oDtn);
}
}
}
else
{
if (hr == MF_E_NOTACCEPTING)
{
while (hr != MF_E_TRANSFORM_NEED_MORE_INPUT)
{
hr = ProcessOutput(time, duration, oDtn);
}
}
}
}
}
return (hr == MF_E_TRANSFORM_NEED_MORE_INPUT ? (oDtn.numBytes > 0 ? oDtn.returnCode : hr) : hr);
}
// Finds and returns the h264 MFT (given in subtype parameter) if available...otherwise fails.
HRESULT FindDecoder(const GUID& subtype)
{
HRESULT hr = S_OK;
UINT32 count = 0;
IMFActivate **ppActivate = NULL;
MFT_REGISTER_TYPE_INFO info = { 0 };
UINT32 unFlags = MFT_ENUM_FLAG_HARDWARE | MFT_ENUM_FLAG_ASYNCMFT;
info.guidMajorType = MFMediaType_Video;
info.guidSubtype = subtype;
hr = MFTEnumEx(
MFT_CATEGORY_VIDEO_DECODER,
unFlags,
&info,
NULL,
&ppActivate,
&count
);
if (SUCCEEDED(hr) && count == 0)
{
hr = MF_E_TOPO_CODEC_NOT_FOUND;
}
if (SUCCEEDED(hr))
{
hr = ppActivate[0]->ActivateObject(IID_PPV_ARGS(&mpDecoder));
}
CoTaskMemFree(ppActivate);
return hr;
}
// reconstructs the sample from encoded data
HRESULT ProcessData(char *ph264Buffer, DWORD bufferLength, LONGLONG& time, LONGLONG& duration, TransformOutput &dtn)
{
dtn.numBytes = 0;
dtn.pData = NULL;
dtn.returnCode = S_FALSE;
IMFSample *pSample = NULL;
IMFMediaBuffer *pMBuffer = NULL;
// Create a new memory buffer.
HRESULT hr = MFCreateMemoryBuffer(bufferLength, &pMBuffer);
// Lock the buffer and copy the video frame to the buffer.
BYTE *pData = NULL;
if (SUCCEEDED(hr))
hr = pMBuffer->Lock(&pData, NULL, NULL);
if (SUCCEEDED(hr))
memcpy(pData, ph264Buffer, bufferLength);
pMBuffer->SetCurrentLength(bufferLength);
pMBuffer->Unlock();
// Create a media sample and add the buffer to the sample.
if (SUCCEEDED(hr))
hr = MFCreateSample(&pSample);
if (SUCCEEDED(hr))
hr = pSample->AddBuffer(pMBuffer);
LONGLONG sampleTime = time - mStartTime;
// Set the time stamp and the duration.
if (SUCCEEDED(hr))
hr = pSample->SetSampleTime(sampleTime);
if (SUCCEEDED(hr))
hr = pSample->SetSampleDuration(duration);
hr = ProcessSample(&pSample, sampleTime, duration, dtn);
::Release(&pSample);
::Release(&pMBuffer);
return hr;
}
// Process the output sample for the decoder
HRESULT ProcessOutput(LONGLONG& time, LONGLONG& duration, TransformOutput& oDtn/*output*/)
{
IMFMediaBuffer *pBuffer = NULL;
DWORD mftOutFlags;
MFT_OUTPUT_DATA_BUFFER outputDataBuffer;
IMFSample *pMftOutSample = NULL;
MFT_OUTPUT_STREAM_INFO streamInfo;
memset(&outputDataBuffer, 0, sizeof outputDataBuffer);
HRESULT hr = mpDecoder->GetOutputStatus(&mftOutFlags);
if (SUCCEEDED(hr))
{
hr = mpDecoder->GetOutputStreamInfo(0, &streamInfo);
}
if (SUCCEEDED(hr))
{
hr = MFCreateSample(&pMftOutSample);
}
if (SUCCEEDED(hr))
{
hr = MFCreateMemoryBuffer(streamInfo.cbSize, &pBuffer);
}
if (SUCCEEDED(hr))
{
hr = pMftOutSample->AddBuffer(pBuffer);
}
if (SUCCEEDED(hr))
{
DWORD dwStatus = 0;
outputDataBuffer.dwStreamID = 0;
outputDataBuffer.dwStatus = 0;
outputDataBuffer.pEvents = NULL;
outputDataBuffer.pSample = pMftOutSample;
hr = mpDecoder->ProcessOutput(0, 1, &outputDataBuffer, &dwStatus);
}
if (SUCCEEDED(hr))
{
hr = GetDecodedBuffer(outputDataBuffer.pSample, outputDataBuffer, time, duration, oDtn);
}
if (pBuffer)
{
::Release(&pBuffer);
}
if (pMftOutSample)
{
::Release(&pMftOutSample);
}
return hr;
}
// Write the decoded sample out
HRESULT GetDecodedBuffer(IMFSample *pMftOutSample, MFT_OUTPUT_DATA_BUFFER& outputDataBuffer, LONGLONG& time, LONGLONG& duration, TransformOutput& oDtn/*output*/)
{
// ToDo: These two lines are not right. Need to work out where to get timestamp and duration from the H264 decoder MFT.
HRESULT hr = outputDataBuffer.pSample->SetSampleTime(time);
if (SUCCEEDED(hr))
{
hr = outputDataBuffer.pSample->SetSampleDuration(duration);
}
if (SUCCEEDED(hr))
{
hr = pMftOutSample->ConvertToContiguousBuffer(&pDecodedBuffer);
}
if (SUCCEEDED(hr))
{
DWORD bufLength;
hr = pDecodedBuffer->GetCurrentLength(&bufLength);
}
if (SUCCEEDED(hr))
{
byte *pEncodedYUVBuffer;
DWORD buffCurrLen = 0;
DWORD buffMaxLen = 0;
pDecodedBuffer->GetCurrentLength(&buffCurrLen);
pDecodedBuffer->Lock(&pEncodedYUVBuffer, &buffMaxLen, &buffCurrLen);
ColorConversion::YUY2toRGBBuffer(pEncodedYUVBuffer,
buffCurrLen,
mpRGBABuffer,
mStreamWidth,
mStreamHeight,
mbEncodeBackgroundPixels,
mChannelThreshold);
pDecodedBuffer->Unlock();
::Release(&pDecodedBuffer);
oDtn.pData = mpRGBABuffer;
oDtn.numBytes = mStreamWidth * mStreamHeight * 4;
oDtn.returnCode = hr; // will be S_OK..
}
return hr;
}
Update: The decoder's output is satisfactory now after enabling CODECAPI_AVLowLatency Mode, But with 2 seconds delay in the stream compared to the sender, I'm able to achieve 15 to 20fps that's a lot better compared to previous. The quality detoriates when there are more number of changes pushed from the source to the encoder. I'm yet to implement hardware accelerated decoding.
Update2: I figured out that the timestamp and duration settings are the ones that affect the quality of the video if set improperly. The thing is, my image source does not emit frames at a constant rate, but it looks like the encoder and decoders expect constant frame rate. When I set the duration as constant and increment the sample time in constant steps the video quality seems to be better but not the greatest though. I don't think what I'm doing is the correct approach. Is there any way to specify the encoder and decoder about the variable frame rate?
Update3: I'm able to get acceptable performance from both encoders and decoders after setting CODECAPI_AVEncMPVDefaultBPictureCount (0), and CODECAPI_AVEncCommonLowLatency properties. Yet to explore hardware accelerated decoding. I hope I would be able to achieve the best performance if hardware decoding is implemented.
The quality of the video is still poor, edges & curves are not sharp. Text looks blurred, and it's not acceptable. The quality is okay for videos and images but not for texts and shapes.
Update4
It seems some of the color information is getting lost in the YUV subsampling phase. I tried converting the RGBA buffer to YUV2 and then back, The color loss is visible but not bad though. The loss due to YUV conversion is not as bad as the quality of the image that is being rendered after RGBA-> YUV2 -> H264 -> YUV2 -> RGBA conversion. It's evident that not just YUV2 conversion the sole reason for the loss of quality but also the H264 encoder that further causes aliasing. I would still have obtained a better video quality if the H264 encode doesn't introduce aliasing effects. I'm going to explore WMV CODECs. The only thing that still bothers me is this, the code works pretty well and is able to capture the screen and save the stream in mp4 format in a file. The only difference here is that I'm using Media foundation transform with MFVideoFormat_YUY2 input format compared to the sink writer approach with MFVideoFormat_RGB32 as input type in the mentioned code. I still have some hope that it is possible to achieve better quality through Media Foundation itself. The thing is MFTEnum/ProcessInput fails if I specify MFVideoFormat_ARGB32 as input format in MFT_REGISTER_TYPE_INFO (MFTEnum)/SetInputType respectively.
Original:
Decoded image (After RGBA -> YUV2 -> H264 -> YUV2 -> RGBA conversion):
Click to open in the new tab to view the full image so that you can see the aliasing effect.
Most consumer H.264 encoders sub-sample the color information to 4:2:0. (RGB to YUV) This means before the encode process even starts your RGB bitmap losses 75% of the color information. H.264 was more designed for natural content rather than screen capture. But there are codecs that are specifically designed to achieve good compression for screen content. For example: https://learn.microsoft.com/en-us/windows/desktop/medfound/usingthewindowsmediavideo9screencodec Even if you increase the bitrate of your H.264 encode - you are working only with 25% of the original color information to start with.
So your format changes look like this:
You start with 1920x1080 red, green and blue pixels. You transform to YUV. Now you have 1920x1080 luma, Cb and Cr. where Cb and Cr are color difference components. This is just a different way of representing colors. Now you scale the Cb and Cr plane to 1/4 of their original size. So your resulting Cb and Cr channels are around 960x540 and your luma plane is still 1920x1080. By scaling your color information from 1920x1080 to 960x540 - you are down to 25% of the original size. Then the full size luma plane and 25% color difference channels are passed into the encoder. This level of reducing the color information is called subsampling to 4:2:0. The subsampled input is required by the encoder and is done automatically by the media framework. There is not much you can do to escape it - outside from choosing a different format.
R = red
G = green
B = blue
Y = luminescence
U = blue difference (Cb)
V = red difference (Cr)
YUV is used to separate out a luma signal (Y) that can be stored with high resolution or transmitted at high bandwidth, and two chroma components (U and V) that can be bandwidth-reduced, subsampled, compressed, or otherwise treated separately for improved system efficiency. (Wikipedia)
Original format
RGB (4:4:4) 3 bytes per pixel
R R R R R R R R R R R R R R R R
G G G G G G G G G G G G G G G G
B B B B B B B B B B B B B B B B
Encoder input format - before H.264 compression
YUV (4:2:0) 1.5 bytes per pixel (6 bytes per 4 pixel)
Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
UV UV UV UV
I'm trying to understand your problem.
My program ScreenCaptureEncode uses default Microsoft encoder settings :
From my results, i think quality is good/acceptable.
You can change profile/level/bitrate with MF_MT_MPEG2_PROFILE/MF_MT_MPEG2_LEVEL/MF_MT_AVG_BITRATE. For CODECAPI_AVEncCommonQuality, it seems like you are trying to use a locally registered encoder, because you're on Win7, to set that value to 100, I guess.
but I do not think that will change things significantly.
So.
here is 3 screenshots with keyboard print screen :
The two last pictures are from the same video encoded file. The video player introduces aliasing when not playing in fullscreen mode. The same encoded file playing in fullscreen mode is not so bad, comparing to the original screen, and with default encoder settings. You should try this. I think we have to look at this more closely.
I think aliasing comes from your video player, and because not playing in fullscreen mode.
PS : I use the video player MPC-HC.
PS2: my program needs to be improved :
EDIT
Did you read this :
CODECAPI_AVLowLatencyMode
Low-latency mode is useful for real-time communications or live capture, when latency should be minimized. However, low-latency mode might also reduce the decoding or encoding quality.
About why my program using MFVideoFormat_RGB32 and yours using MFVideoFormat_YUY2. By default, SinkWriter has converters enable. The SinkWriter converts MFVideoFormat_RGB32 to a compatible h264 encoder format. For Microsoft encoder, read this : H.264 Video Encoder
Input format :
So there is no MFVideoFormat_RGB32. The SinkWriter does the conversion using the Color Converter DSP, I think.
so definitely, the problem does not come from converting rgb to yuv, before encoding.
PS (last)
like Markus Schumann said ;
H.264 was more designed for natural content rather than screen capture.
He should have mentioned that the problem is particularly related to text capture.
You just have found encoder limitation. I just think that no encoder is optmized for text encoding, with an acceptable streching, like I mention with video player rendering.
You see aliasing on final video capture, because it is fixed information inside the movie. Playing this movie in fullscreen (same as capture) is OK.
On Windows, text is calculate according to the screen resolution. So display is always good.
this is my last conclusion.
After so much research and effort, the problems are fixed. Color quality problem was due to the software-based color conversion which leads to aliasing (RGB to YUV in the encoder and back at the decoder). Using a hardware-accelerated color convertor solved the aliasing and image quality problems.
Setting optimal values to CODECAPI_AVEncMPVGOPSize, CODECAPI_AVEncMPVDefaultBPictureCount and CODECAPI_AVEncCommonLowLatency properties solved the buffering problems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With