I just started using the latest build of ffmpeg into which ffmpeg-mt has been merged.
However, since my application uses TBB (Intel Threading Building Blocks), the ffmpeg-mt imlementation with new thread creation and synchronization does not quite fit, as it could potentially block my TBB tasks executing the decode functions. Also it would trash the cache unnecessarily.
I was looking around in pthread.c which seems to implement the interface which ffmpeg uses to enable multithreading.
My question is whether it would be possible to create a tbb.c which implements the same functions but using tbb tasks instead of explicit threads?
I am not experienced with C, but my guess is that it would not be possible to easily compile tbb (which is C++) into ffmpeg. So maybe somehow overwriting the ffmpeg function pointers during run-time would be the way to go?
I would appreciate any suggestions or comments in regards to implementing TBB into ffmpeg threading api.
So I figured out how to do it by reading through the ffmpeg code.
Basicly all you have to do is to include the code below and use tbb_avcodec_open/tbb_avcodec_close
instead of ffmpegs' avcodec_open/avcodec_close
.
This will use TBB tasks to execute decoding in parallel.
// Author Robert Nagy
#include "tbb_avcodec.h"
#include <tbb/task.h>
#include <tbb/atomic.h>
extern "C"
{
#define __STDC_CONSTANT_MACROS
#define __STDC_LIMIT_MACROS
#include <libavformat/avformat.h>
}
int task_execute(AVCodecContext* s, std::function<int(void* arg, int arg_size, int jobnr, int threadnr)>&& func, void* arg, int* ret, int count, int size)
{
tbb::atomic<int> counter;
counter = 0;
// Execute s->thread_count number of tasks in parallel.
tbb::parallel_for(0, s->thread_count, 1, [&](int threadnr)
{
while(true)
{
int jobnr = counter++;
if(jobnr >= count)
break;
int r = func(arg, size, jobnr, threadnr);
if (ret)
ret[jobnr] = r;
}
});
return 0;
}
int thread_execute(AVCodecContext* s, int (*func)(AVCodecContext *c2, void *arg2), void* arg, int* ret, int count, int size)
{
return task_execute(s, [&](void* arg, int arg_size, int jobnr, int threadnr) -> int
{
return func(s, reinterpret_cast<uint8_t*>(arg) + jobnr*size);
}, arg, ret, count, size);
}
int thread_execute2(AVCodecContext* s, int (*func)(AVCodecContext* c2, void* arg2, int, int), void* arg, int* ret, int count)
{
return task_execute(s, [&](void* arg, int arg_size, int jobnr, int threadnr) -> int
{
return func(s, arg, jobnr, threadnr);
}, arg, ret, count, 0);
}
void thread_init(AVCodecContext* s)
{
static const size_t MAX_THREADS = 16; // See mpegvideo.h
static int dummy_opaque;
s->active_thread_type = FF_THREAD_SLICE;
s->thread_opaque = &dummy_opaque;
s->execute = thread_execute;
s->execute2 = thread_execute2;
s->thread_count = MAX_THREADS; // We are using a task-scheduler, so use as many "threads/tasks" as possible.
}
void thread_free(AVCodecContext* s)
{
s->thread_opaque = nullptr;
}
int tbb_avcodec_open(AVCodecContext* avctx, AVCodec* codec)
{
avctx->thread_count = 1;
if((codec->capabilities & CODEC_CAP_SLICE_THREADS) && (avctx->thread_type & FF_THREAD_SLICE))
thread_init(avctx);
// ff_thread_init will not be executed since thread_opaque != nullptr || thread_count == 1.
return avcodec_open(avctx, codec);
}
int tbb_avcodec_close(AVCodecContext* avctx)
{
thread_free(avctx);
// ff_thread_free will not be executed since thread_opaque == nullptr.
return avcodec_close(avctx);
}
Re-posting here my response to you at the TBB forum, for sake of whoever at SO can be interested.
Your code in the answer above looks good to me; a clever way to use TBB in a context that was designed with native threads in mind. I wonder if it can be made even more TBBish, so to say. I have some ideas which you can try if you have time and desire.
The following two items can be of interest if there is a desire/need to control the number of threads.
tbb::task_scheduler_init
(TSI) object, and initialize it with as many threads as desired (not necessary MAX_THREADS). Keep the address of this object in s->thread_opaque
if possible/allowed; if not, a possible solution is a global map that maps AVCodecContext*
to the address of the corresponding task_scheduler_init
.Independently of the above, another potential change is in how to call tbb::parallel_for
. Instead of using it to merely create enough threads, cannot it be used for its direct purpose, like below?
int task_execute(AVCodecContext* s,
std::function<int(void*, int, int, int)>&& f,
void* arg, int* ret, int count, int size)
{
tbb::atomic<int> counter;
counter = 0;
// Execute 'count' number of tasks in parallel.
tbb::parallel_for(tbb::blocked_range<int>(0, count, 2),
[&](const tbb::blocked_range<int> &r)
{
int threadnr = counter++;
for(int jobnr=r.begin(); jobnr!=r.end(); ++jobnr)
{
int r = func(arg, size, jobnr, threadnr);
if (ret)
ret[jobnr] = r;
}
--counter;
});
return 0;
}
This can perform better if count
is significantly greater than thread_count
, because a) more parallel slack means TBB works more efficiently (which you apparently know), and b) the overhead of the centralized atomic counter is spread over more iterations. Note that I selected the grain size of 2 for blocked_range
; this is because the counter is both incremented and decremented inside the loop body, and so at least two iterations per task (and correspondingly, count>=2*thread_count
) are necessary to "match" your variant.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With