Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jitter buffer implementation in Java

I am looking for an adaptive Jitter Buffer implementation in Java for my VOIP application. I wrote a fixed jitter buffer for my application, but I either run into a buffer underrun or buffer overrun issues because of the poor quality of network.

Are there any Java based implementations of adaptive jitter buffer available to use directly with my application or to use as a reference.

Any help would be greatly appreciated.

Thanks

like image 483
over.drive Avatar asked Feb 29 '12 10:02

over.drive


1 Answers

I've been working on this very problem (in C though) for a while, and just when I think I've got it, the internet gets busy or otherwise changes somewhere and boom! Some choppy audio again. Well. I'm pretty sure I've got it licked now.

Using the algorithm below, I have really, really good sounding audio quality. I've compared it to other softphones I've run under the same network conditions and it performs noticeably better.

The first thing I do is try to determine if the PBX or other SIP proxy to which we're REGISTERing is on a local network with the UA (softphone) or not.

If it is, I define my jitterbuffer as 100ms, if not, I use 200ms. That way I limit my latency if I can; even 200ms does not produce any noticeable conversational trouble or over-talk.

So. Then I use a system counter of whatever type you have available e.g. Windows=GetTickCount64(), to fill a variable with the millisecond-precision time that my first packet came in for playback. Let's call that variable "x".

Then when ( ( GetTickCount64() - x ) > jitterbuffer ) is true, I begin playback on that buffer.

Straight forward fixed-length jitter buffer implementation. Here's the tricky bit.

While I'm decoding the RTP frame (like from muLaw to PCM) to buffer it for playback I calculate the AVERAGE ABSOLUTE amplitude of the audio frame, and save it along with the frame for playback.

I do this by having a struct like so:

typedef struct tagCCONNECTIONS {
    char binuse;
    struct sockaddr_in client;
    SOCKET socket;
    unsigned short media_index;
    UINT32 media_ts;
    long ssrc;
    unsigned long long lasttimestamp;
    int frames_buffered;
    int buffer_building;
    int starttime;
    int ssctr;
    struct {
            short pcm[160];
    } jb[AUDIO_BUFFER]; /* Buffered Audio frame array */
    char jbstatus[AUDIO_BUFFER]; /* An array containing the status of the data in the     CCONNETIONS::jb array */
    char jbsilence[AUDIO_BUFFER];
    int jbr,jbw; /* jbr = read position in CCONNECTIONS::jb array, jbw = write position     */
    short pcms[160];
    char status;
    /* These members are only used to buffer playback */
    PCMS *outraw;
    char *signal;
    WAVEHDR *preparedheaders;
    /**************************************************/
    DIALOGITEM *primary;
    int readptr;
    int writeptr;
} CCONNECTIONS;

Ok, notice the tagCCONNECTIONS::jbsilence[AUDIO_BUFFER] struct member. This way, for every decoded audio frame in tagCCONNECTIONS::jb[x].pcm[], there's corresponding data as to whether that frame is audible or not.

This means that for every audio frame that is about to be played, we have the info as to whether that frame is audible.

Also...

#define READY 1
#define EMPTY 0

The tagCCONNECTIONS::jbstatus[AUDIO_BUFFER] field let's us know if the particular audio frame we're thinking about playing is READY or EMPTY. In the theoretical case of buffer underflow, it COULD be empty, in which case we would ordinarily wait for it to be READY, then begin playing...

Now in my routine that plays the audio...I have two main functions. One, called pushframe(), and one called popframe().

My thread that opens the network connection and receives the RTP calls pushframe(), which converts the muLaw to PCM, calculates the AVERAGE ABSOLUTE amplitude of the frame and marks it as silent if it's inaudibly quiet, and marks the ::jbstatus[x] as READY

Then in my thread that plays the audio, we first check if the jitterbuffer time has expired, again, by

if ( ( GetTickCount64() - x ) > jitterbuffer ) {...}

Then, we check if the next frame to be played is READY (meaning it has indeed been filled).

Then we check if the frame AFTER THAT FRAME is READY also, and IF IT IS AUDIBLE OR SILENT!

*** IMPORTANT

Basically, we know that a 200ms jitter buffer can hold ten 20ms audio frames.

If at any point after the initial 200ms jitter buffer delay (saving up audio) the number of audio frames we have queued drops below 10 ( or jitterbuffer / 20 ), we go into what I call "buffer_building" mode. Where if the next audio frame we're scheduled to play is silent, we tell the program that the jitter buffer isn't full yet, it is still 20 milliseconds away from being full, but we go ahead and play the frame we're on now because it's the NEXT frame we see that's "silent"...once again. We just don't play the silent frame and use the period of silence to wait on an inbound frame to refill our buffer.

tagCCONNECTIONS::lasttimestamp = GetTickCount64() - (jitterbuffer-20);

This will have a period of complete silence during what would have been "assumed" silence, but allows the buffer to replenish itself. Then when I have the full 10 frames full again, I go out of "buffer_building" mode, and just play the audio.

I enter "buffer_building" mode even when we're short one frame from a full buffer because a long-winded person could be talking and there could not be much silence. That could deplete a buffer quickly even during "buffer_building" mode.

Now..."What is silence?" I hear you ask. In my messing around, I've hard coded silence as any frame with an AVERAGE ABSOLUTE 16 bit PCM amplitude of less than 200. I figure this as follows:

int total_pcm_val=0;
/* int jbn= whatever frame we're on */
for (i=0;i<160;i++) {
    total_pcm_val+=ABS(cc->jb[jbn].pcm[i]);
}
total_pcm_val/=160;
if (total_pcm_val < 200) {
    cc->jbsilence[jbn] = 1;
}

Now, I'm actually planning on keeping an overall average amplitude on that connection and playing around with maybe if the current audio frame's amplitude we just received is 5% or less of the overall average amplitude, then we consider the frame silent, or maybe 2%...I don't know, but that way if there's a lot of wind or background noise, the definition of "silence" can adapt. I have to play with that, but I believe that's the key to replenishing your jitter buffer.

Do it when there's not important information to listen to, and keep the actual information (their voice) crystal clear.

I hope this helps. I'm a bit scatter-brained when it comes to explaining things, but I'm very, very pleased with how my VoIP application sounds.

like image 71
Justin Jack Avatar answered Sep 21 '22 08:09

Justin Jack