In this blog post I shall discuss how WebRTC works in the browser. Here is the full blog series.
MediaStream: acquisition of audio and video streams
RTCPeerConnection: communication of audio and video data
RTCDataChannel: communication of arbitrary application data
Connections between two peers are created using
RTCPeerConnection interface. Once a connection has been established and opened, media streams (
MediaStreams) and/or data channels (
RTCDataChannels) can be added to the connection. The above APIs are just the tip of the iceberg: signaling, peer discovery, connection negotiation, security are just a few components required to bring it all together.
Peer – to – peer connection setup
RTCPeerConnection interface is responsible for managing the full life cycle of each peer-to-peer connection.
- Manages the full ICE workflow for NAT traversal
- Sends automatic (STUN) keep-alives between peers
- Keeps track of local and remote streams
- Triggers automatic stream renegotiation as required
- Provides necessary APIs to –
- generate the connection offer
- accept the answer
- query the connection for its current state, and more!
In order to establish a successful peer-to-peer connection, browser must –
- Notify the other peer of the intent to open a peer-to-peer connection, such that it knows to start listening for incoming packets.
- Identify potential routing paths for the peer-to-peer connection on both sides of the connection and relay this information between peers.
- Exchange the necessary information about the parameters of the different media and data streams, protocols, encodings used, and so on.
The built-in ICE protocol performs the necessary routing and connectivity checks (Step 2). However, the delivery of notifications (signaling) (Step 3) and initial session negotiation (Step 1) is left to the application.
Signaling is not defined by WebRTC.
The thinking behind WebRTC call setup has been to fully specify and control the media plane, but to leave the signaling plane up to the application as much as possible. Why? Different applications may prefer to use different protocols, such as –
- Session Initiation Protocol (SIP) – Application-level signaling protocol, widely used for voice over IP (VoIP) and videoconferencing over IP networks.
- Jingle – Signaling extension for the XMPP protocol, used for session control of voice over IP and videoconferencing over IP networks.
- Extensible Messaging and Presence Protocol (XMPP) is an open XML technology for real-time communication, which powers a wide range of applications including instant messaging, presence and collaboration.
- ISDN User Part (ISUP) – Signaling protocol used for setup of telephone calls in many public switched telephone networks around the globe.
- Integrated Services Digital Network (ISDN) – is a set of communication standards for simultaneous digital transmission of voice, video, data, and other network services over the traditional circuits of the public switched telephone network
What is Signaling?
Before any connectivity checks or session negotiation can occur, we must find out if the other peer is reachable and if it is willing to establish the connection.
Signaling is a process of communication to exchange information before setting up a connection.
The caller extends an offer, and the callee returns an answer.
SDP – Session Description Protocol
SDP is a standard for describing the multimedia content of the connection such as resolution, formats, codecs, encryption, etc so that both peers can understand each other once the data is transferring.
WebRTC uses SDP to define the media characteristics of a call.
JSEP and SDP
JSEP’s handling of session descriptions is simple and straightforward.
- Whenever an offer/answer exchange is needed, the initiating side creates an offer by calling a
- The application optionally modifies that offer, and then uses it to set up its local config via the
- The offer is then sent off to the remote side over its preferred signaling mechanism (e.g., WebSockets).
- Upon receipt of that offer, the remote party installs it using the
- When the call is accepted, the callee uses the
createAnswer()API to generate an appropriate answer, applies it using
setLocalDescription(), and sends the answer back to the initiator over the signaling channel.
- When the offerer gets that answer, it installs it using
setRemoteDescription(), and initial setup is complete.
- This process can be repeated for additional offer/answer exchanges.
What is ICE?
Interactive Connectivity Establishment (ICE) is a framework to allow the web browser to connect with peers. There are many reasons why a straight up connection from Peer A to Peer B simply won’t work.
- It needs to bypass firewalls that would prevent opening connections
- Give a unique address if like most situations the device doesn’t have a public IP address
- Relay data through a server if the router doesn’t allow to directly connect with peers.
WebRTC’s ICE framework manages most of this complexity:
- Each RTCPeerConnection connection object contains an “ICE agent.”
- ICE agent is responsible for gathering local IP, port tuples (candidates).
- ICE agent is responsible for performing connectivity checks between peers.
- ICE agent is responsible for sending connection keepalives.
Once a session description (local or remote) is set, local ICE agent automatically begins the process of discovering all the possible candidate IP, port tuples for the local peer:
- ICE agent queries the operating system for local IP addresses.
- If configured, ICE agent queries an external STUN server to retrieve the public IP and port tuple of the peer.
- If configured, ICE agent appends the TURN server as a last resort candidate. If the peer-to-peer connection fails, the data will be relayed through the specified intermediary.
ICE and Signaling
ICE is part of WebRTC, but Signaling isn’t
- JSEP decouples the ICE state machine from the overall signaling state machine.
- The ICE state machine must remain in the browser, because only the browser has the necessary knowledge of candidates and other transport info.
- Through its abstraction of signaling, the JSEP approach does require the application to be aware of the signaling process.
What is STUN, NAT & TURN?
Session Traversal Utilities for NAT (STUN) is a protocol to discover your public address and determine any restrictions in your router that would prevent a direct connection with a peer. The client will send a request to a STUN server on the internet who will reply with the client’s public address and whether or not the client is accessible behind the router’s NAT.
Network Address Translation (NAT) is used to give the device a public IP address. A router will have a public IP address and every device connected to the router will have a private IP address. Requests will be translated from the device’s private IP to the router’s public IP with a unique port. That way a unique public IP for each device isn’t needed but can still be discovered on the internet.
Some routers will have restrictions on who can connect to devices on the network. This can mean that even though we have the public IP address found by the STUN server, not anyone can create a connection. In this situation we need to turn to TURN. Some routers using NAT employ a restriction called ‘Symmetric NAT’. This means the router will only accept connections from peers you’ve previously connected to.
Traversal Using Relays around NAT (TURN) is meant to bypass the Symmetric NAT restriction by opening a connection with a TURN server and relaying all information through that server. You would create a connection with a TURN server and tell all peers to send packets to the server which will then be forwarded to you. This obviously comes with some overhead so is only used if there are no other alternatives.
Below is the summary of the above:
PS: Images used in this post are copied from the internet from one of the above links. I don’t intend to violate any copyright laws, this blog post is a compilation of my notes for my upcoming workshop.