Introduction to jVoiceBridge

Category: Tech Comments: No comments

Introduction

The application jVoiceBridge is a software-only audio mixer that handles Voice over IP (VoIP) audio communication and mixing, for tasks such as conference calls, voice chat, speech detection, and audio for 3D virtual environments. Currently it is most commonly known for its use in the Wonderland project, a 3D virtual environment developed by Sun.

In this article I will give an introduction to the design of the jVoiceBridge software. I am not providing a user manual, nor will I be giving detailed operating instructions; if you’re looking for that information, have a look at the jVoiceBridge WiKi. Instead, I’ve been reverse engineering the source code and I will provide some diagrams and explanations which I drew up during my exploration. It won’t cover every detail, but it should be enough to a give you a general idea of how the code is structured.

Many thanks go out to Joe Provino from Sun for patiently answering all my questions!

Important terms and abbreviations

Call: a communication session between jVoiceBridge and a remote party. It is identified by a Call id.
CallParticipant: A CallParticipant is used to hold the parameters for a call.  There’s one CallParticipant, one CallHandler and one ConferenceMember per call. The CallHandler handles the call setup and call status.  The ConferenceMember deals with the call being part of a conference. It has a MemberSender and a MemberReceiver.
Conference: a group of people talking to eachother. jVoiceBridge maintains a Conference by receiving data from all conferenceMembers, mixing it together, and outputting the mix to each member. ConferenceMembers may enhance their conferencing experience by creating private mixes or whispergroups.
ConferenceMember: someone participating in a conference. A ConferenceMember has a receiving part (MemberReceiver) and a sending part (MemberSender).
ConferenceSender: The job of the conference sender is to send a voice data packet to each conference member every 20 ms.
ConferenceReceiver: The ConferenceReceiver receives data from everyone who’s talking.
DtmfSuppression: DtmfSuppression means any keypresses will be filtered out of the conference mix. The static variable CallHandler.dtmfDetection defaults to true.  If the CallParticipant also has dtmfSuppression set to true (the default) then when a dtmf key is detected the member’s input buffer is flushed so that others won’t hear the dtmf key.
Private Mix: Each call has MixDescriptors which describe what audio the call should hear. A descriptor has information about the source of the audio and how loud and in what direction the call should hear the audio.
In a simple conference, there’s a “common mix” to which the data from all calls in the conference are added.  In order to not hear yourself when you talk, your own audio needs to be subtracted out of the common mix when the data is sent to you. So you have a descriptor for the common mix with a volume level of one and another descriptor for your own audio source with a volume level of minus 1.
But let’s say you want to raise the volume of my audio because I speak too softly. This is what we call a private mix.  You would need another descriptor with my audio source and a volume level of say .2.  It’s .2 because my audio is already in the common mix at a volume of 1.  So when everything is added together, the volume you hear from me will be 1.2.
If there are a lot of private mixes, the common mix may not make any sense to maintain. jVoiceBridge supports this, too.  In that case, you only hear audio sources for which you have a private mix (i.e. a mix descriptor).
Treatment: an announcement which is played when a certain event occurs e.g. when you’re the first party in a conference.
WhisperGroup: people in a conference can create a WhisperGroup if they want to talk in private. Their audio will not be sent to the conference.

Overview

The following pictures give an overview of the most important classes and packages related to handling commands which are received over the telnet connection.

jvoicebridge1

The RequestHandler is the main entry point into jVoiceBridge. As soon as the bridge accepts a new telnet connection, it creates a new RequestHandler thread for reading socket input data. The RequestHandler dispatches the data to the RequestParser, until a newline is received. At that point, the RequestHandler creates the appropriate CallHandler for handling the call.
A request is a command, followed by parameters and separated by colons. The RequestParser distinguishes different types of requests. Some requests are parsed and handled immediately (parseImmediateRequest), while others are parsed and executed upon reception of a newLine (parseCallParameters).
If the request matches none of the immediate commands, the parseImmediateRequest method returns false, and RequestParser will just try to parse call parameters from the supplied input. Call parameters are collected in a CallParticipant object.
If none of these call parameters match the given request, finally an attempt is made to parse the request as a twoparty-call request. If that fails, an exception is thrown (“Illegal request”). Communication towards the end user is handled by the RequestHandler‘s writeToSocket method.

The jVoiceBridge WiKi provides an extensive list of all immediate requests and all call parameters. Some requests are related to actual call behaviour – for example the createConference or mute requests. Others are bridge configuration commands, for example defaultSipProxy or rtpTimeout.
To start understanding the design, I recommend to focus on some important requests. The following requests are dispatched by the RequestParser to static ConferenceManager methods:

  • createConference
  • createWhisperGroup
  • destroyWhisperGroup
  • endConference
  • removeConference
  • recordConference
  • playTreatmentToConference

The following requests are dispatched by the RequestParser to static CallHandler methods:

  • mute
  • muteWhisperGroup
  • muteConference
  • playTreatmentToCall

The following requests are dispatched by the RequestParser to a specific CallHandler instance:

  • privateMix: delegated to the ConferenceMember of the CallHandler identified by the supplied call id.
  • removeCallFromWhisperGroup: delegated to the ConferenceMember of the CallHandler of the supplied call id.
  • addCallToWhisperGroup: delegated to the CallHandler for the supplied call id.
  • whisper: delegated to the ConferenceMember of the CallHandler of the supplied call id.

The following requests are some examples of bridge configuration commands:

  • rtpTimeout
  • sendSipUriToProxy
  • defaultSipProxy
  • voIPGateways

An example: starting a new Conference

I recommend that you have the jVoiceBridge source code in front of you while reading this section. I will use a typical example to walk through the code: the creation of a new conference. We start by imagining that a user has started a telnet connection to the bridge, and entered the following request:

$telnet_prompt:> createConference = testid:PCM/44100/2

No conference with this id exists yet, so a new ConferenceManager is created and added to the ConferenceManager‘s static list of ConferenceManagers. Media settings and name are set and ConferenceSender and ConferenceReceiver are created. The conference id is stored in the CallParticipant. The following sequence diagram illustrates this behaviour.

jvoicebridge2

In case of synchronous mode, all that is returned to the telnet client is “END — SUCCESS“. If synchronousMode is false, RequestHandler will only respond to the telnet client when one of its CallEventListener methods is called.

$telnet_prompt:> (newline)

The newline will cause the RequestHandler to do a validateAndAdjustParameters first. That will report to the user (among other things) that no phone number was specified! The user specifies one by typing:

$telnet_prompt> phoneNumber = sip:joe@host.com

…followed by:

telnet_prompt> (newline)

RequestHandler will determine what to do next, by looking at the CallParticipant. Its cp.remoteMediaInfo is still null, cp.migrateCall is false, secondPartyNumber was not set (only set for a 2-party call request or a call migration), so an OutgoingCallHandler is created and started.

The OutgoingCallHandler


The following picture shows some important classes which are involved during creation of a conference.

jvoicebridge3

OutgoingCallHandler asks the ConferenceManager for this CallParticipant‘s conferenceManager – this had just been created. OutgoingCallHandler calls conferenceManager.joinConference(cp), upon which the conferenceManager sets some audio treatments in the CallParticipant and creates a new ConferenceMember object for this CallParticipant. The ConferenceMember is then:

  • added to the conferenceManager‘s list of conference members,
  • added to conferenceManager’s ConferenceReceiver, so it will be registered to a java.nio.DatagramChannel. Notice that the ConferenceReceiver thread starts as soon as it’s created.
  • returned to the OutgoingCallHandler.

The OutgoingCallHandler maintains a static Vector of active calls. It adds itself to this Vector, so the right CallHandler can be found later when a call is hung up.
The described sequence of events is illustrated in the following picture:

jvoicebridge4

The OutgoingCallHandler then examines if we’d specified any gateways (not in this example) and calls placeCall(). The following sequence diagram shows what happens then.

jvoicebridge5

The Bridge‘s default protocol is used, so a SipTPCCallAgent (Third Party Call) is created. Upon construction, the SipTPCCallAgent retrieves and remembers the MediaInfo from the OutgoingCallHandler‘s conferenceManager. It creates a SipUtil, which in turn constructs a SdpManager based on the MediaInfo passed in (see below for more info).

Finally, the OutgoingCallHandler calls SipTPCCallAgent.initiateCall();

SipTPCCAllAgent changes to state CallState.INVITED, and asks the OutgoingCallHandler for its memberReceiver‘s InetSocketAddress. This address and the CallParticipant are then passed to the SipUtil.sendInvite() method. That method returns a clientTransaction, which the SipTPCCallAgent remembers. The SipTPCCallAgent adds itself as a listener for this call id to the (Jain) SIPServer.

The SipUtil first generates the SDP content. This is partially done by the SdpManager, which in turn gets most of the information it needs from the MediaInfo object.

With this SDP, the SipUtil then populates a bunch of SIP headers (from, to , requestUri, Via-headers, etc), asks its sipProvider for a new call id and a new clientTransaction, and sends the INVITE. The To: field is set to cp.getPhoneNumber().

The MediaInfo object

The com.sun.voip.MediaInfo object is basicly a collection of supported RTP settings. It offers a static findMediaInfo method which can be used to find RTP settings based on a payload byte, or based on encoding, sampleRate and number of Channels.

Adding other parties to the conference

To add other parties to the conference, first select the conference

$telnet_prompt>conferenceId = [some_id]

Then simply type in the phone number of the person you’d like to be added:

$telnet_prompt> phoneNumber = sip:joe@host.com

Complete your request with a newline:

$telnet_prompt> (newline)

Receiving, mixing and sending audio

The following sequence diagram illustrates the process of sending and receiving audio in a conference.

jvoicebridge6

Each conference has a ConferenceReceiver for receiving socket data, which runs in a separate Thread. Attached to each socket is a list of MemberReceivers, corresponding to the parties in conference. This mechanism uses java.nio, each selectionKey corresponds to a MemberReceiver. When data arrives on the socket, the ConferenceReceiver dispatches the data to each memberReceiver’s receive() method, depending on the selectedKeys. Each memberReceiver in turn calls its own receiveMedia method. Eventually the data is stored in memberReceiver.currentContribution variable, waiting to be mixed and sent.

The class diagram below shows the aforementioned classes and their relations.

jvoicebridge7

The central object for mixing audio is the MixManager. It contains an ArrayList of mixDescriptors, which can be added and removed by calling addMix and removeMix. A ConferenceMember calls mixmanager.addMix() to add itself during its initialize() method. This happens when a call has been established (by the SipTPCCallAgent) and we know the port at which the member (CallParticipant) listens for data. At that moment, both the ConferenceMember’s MemberSender and MemberReceiver are also initialized.

Each conference has a ConferenceSender for sending data – but read my remark further down about the loneSender. The ConferenceSender runs as a separate Thread and continuously calls its own sendDataToConferences() method. When a list of all MemberSenders has been constructed, the ConferenceSender calls each member’s sendData() method. That method calls the essential method memberSender.sendData(mixManager.mix()), which causes this ConferenceMember’s MixManager to gather all mixDataSourcescurrentContributions, mix them, after which each memberSender outputs it as an RTP stream to this member’s socket.

It appears that a member is always talking in some whisper group.  For a whisper group with a common mix, a member has a descriptor for the common mix and a descriptor to subtract out its own data. If the member wants to adjust the volume of some other member, then it will have a descriptor for that other member.
For a whisper group without a common mix, a member has descriptors for other members it wants to hear. This is used in Project Wonderland, where an avatar should only hear others in range. If you and I are close enough to each other and not near anybody else, we’d have one descriptor for each other and that’s it.

Questions and answers

Q: Who owns a conference?
A: There’s no concept of an owner.  If you don’t explicitly create a conference, it will be created when the first call joins and destroyed when the last call leaves.  When you explicitly create a conference it stays until someone destroys it.

Q: Suppose Joe and Yvo are in a conference with a bunch of other people, and I want to create a whispergroup for the two of us because we have some offline business to discuss. I guess I should use createWhisperGroup(conferenceId, whisperGroupId) and   addCallToWhisperGroup(whisperGroupId, callId) telnet commands. But how do I know the conferenceId and Yvo’s and Joe’s call ids?
A: If you’re using telnet commands, conferenceInfo (or ci) will display this information. If you’re doing this programatically, you would probably want to specify the callId for each call and you’ll have to pick something for the conferenceId as well.

Q: Is SDP renegotiation supported? Suppose I’m in a conference but I’m turning on my webcam, I’d like to do a new handshake to renegotiate SDP because I can now handle video. Or my broadband connection suddenly deteriorated so i’d like a lower samplerate for my audio. Is something like that possible?
A: I don’t think we ever fully supported re-INVITE.  With project wonderland we have an Audio Quality setting.  That changes the settings and hangs up the call and places the call again.  So we’re managing that at a higher level than the bridge.

Q: What happens exactly when an announcement is played to a conference? Is a treatment “just another call” ?
A: A treatment is an AudioSource so it can be used as the audio source in a MixDescriptor. To play a treatment to a call, the treatment gets added to the ConferenceMember‘s memberTreatments ArrayList. When the next treatment is started, a mix descriptor is added for the member with the treatment as the AudioSource. When the member’s saveContribution method is called, the current treatment’s saveContribution method is called. When it comes time to mix, the mix manager just goes through all the mix descriptors like always. When the treatment finishes, the mix descriptor for the treatment is removed.

Q: How does the integrated STUN server work?
A: We modified the NIST SIP Stack so that it multiplexes the SIP port (usually 5060) with STUN. If data is received on the SIP Port and the first short is 0001 (STUN Binding Request), then the STUN server replies.

Q: What does call migration mean?
A:
It’s just a way to transfer a call to another phone without leaving the conference. If I’m at work using my office extension, I can migrate to my cell phone and people in the conference won’t notice except for the change in audio quality.

Points of improvement and other details

The following issues were just some things I noticed while I was investigating the code. Some of them are points for improvement:

  • Some data is actually handled by the RequestHandler and not dispatched to the RequestParser, for example the “CANCEL” request, which is dispatched straight to the OutgoingCallHandler or a “DETACH” request, which just closes the socket.
  • Some CallHandler methods are static while others are non-static. The non-static methods are invoked by e.g. the RequestParser after looking up the correct CallHandler instance, based on a call id which was passed as a command parameter. Some static methods, however, do exactly the same thing, e.g. recordFromMember.
  • There is no decent exception handling in the RequestParser yet. Any exception which occurs while parsing the supplied parameters is caught and rethrown as a ParseException. However, all ParseExceptions are caught on the highest level and ignored, not even logged. This is due to the way the parser works, it’s just a fallthrough mechanism trying to parse commands as the previous one failed.
  • The initial INVITE that is sent out by the SipUtil contains a hardcoded CSeq value of 1.
  • There is the concept of the loneSender, which means a single ConferenceSender is used to send data to all members of all conferences. Joe: “We tried having a separate sender thread for each conference.  Then we decided it would be better to have one thread set up the work to be done and then fire off as many threads as there are processors to do the the actual sending.  This is the way we’ve been running.”
  • SipUtil tries to determine the remoteAddress from the CallParticipant by looking at its phoneNumber. However, the remoteAddress variable is not used!
  • The MixManager uses a static WhisperGroup.mixData() method for doing the actual mixing of audio, even when there aren’t actually any whispergroups. A bit confusing.
  • I think there is always at least one WhisperGroup: the conferenceWhisperGroup, created when the WGmanager is constructed. A WGManager is constructed when a ConferenceManager is constructed.
  • The migrateToBridge request seems to offer cool functionality to migrate a call to another jVoiceBridge instance, but it may not actually work…

Points for further study

If this walkthrough was useful, as a next step you could try and answer the following questions. If you have the answers, please let me know 😉

  • Examine threading. How many threads are actually running at any one point in time?
  • Event handling. Which event listeners have been defined, how do they work? Which events can be reported and handled inside the bridge?

    Leave a reply

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>