Sadjad Fouladi, John Emmons, Emre Orbay, Catherine Wu, Riad S. Wahby, and Keith Winstein. “Salsify: Low-Latency Network Video through Tighter Integration between a Video Codec and a Transport Protocol.” In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’18), pp. 267–282. USENIX Association, 2018.
It's just the HTML template! They all look like this. We promise, this is an academic research project at a university. The code is open-source, and the paper and raw data are open-access. The hope is that these ideas will influence the industry and lead to better real-time video for everybody.
Today's real-time video apps are built out of two separate components: a “video codec” that compresses video, and a “transport protocol” that transmits packets of data and estimates how many can be sent without overloading the network. These components are designed and built separately, often by different companies, then combined into an overall program such as Skype or FaceTime.
Each component has its own control loop. The transport protocol has a “congestion-control” that tries to figure out how fast the network is, and the codec has its own “rate-control” algorithm that tries to make the compressed video match what the transport protocol is telling it. We found that these dueling control loops—which are found in Skype, FaceTime, Hangouts, and the WebRTC reference implementation—can yield suboptimal results over unpredictable networks.
Salsify’s main contribution is in jointly controlling the frame-by-frame control of compression and the packet-by-packet control of transmission in a single control loop. This lets the video stream track the network’s varying capacity, avoiding stalls (see the video above).
The results are in the paper, and the raw data are available on GitHub, but bottom line, on average across our network tests, compared with Skype, FaceTime, Hangouts, and Chrome's WebRTC with and without VP9-SVC, Salsify reduced delay (at the 95th percentile) by 4.6×, while also improving SSIM by about 60% (2.1 dB).
You probably don’t want to, at least not yet. The implementation is only for Linux and doesn't have audio. The value is mostly in demonstrating the benefit of Salsify’s design and in allowing other researchers (and industry) to replicate our results, with the intent that Salsify’s ideas will be incorporated by others and improve the quality of real-time video in many applications.
We played the same video through Salsify’s sender, and the sender programs for Skype, FaceTime (on a Mac), Google Hangouts, and WebRTC in Google Chrome. We allowed the sender to transmit over an emulated network to its receiver, captured the receiver’s video, and calculated the delay and quality of each frame. We tested different kinds of emulated networks: LTE, slower cellular links, and long-distance Wi-Fi.
We built a prototype implementation of Salsify, including a video-aware transport protocol, a VP8 video codec with the ability to save/restore internal state, and a unified control loop. We then compared it end-to-end with Skype, FaceTime, Hangouts, and WebRTC (as implemented in Google Chrome, with and without scalable/layered video coding). We emulated a hardware webcam and played a reproducible 720p60 test video through each program, with each frame tagged with a 2D barcode. We timestamped each frame on exit, using the hardware clock on a Blackmagic Decklink card. The unmodified sender transmitted over an emulated time-varying network connection (using cellsim/mahimahi), to an unmodified receiver. We ran the receiver program fullscreen on an HDMI output, which we captured and timestamped using the same Blackmagic hardware clock that timestamped the frame when emitted. We recognized the 2D barcode (which was still recognizable even after lossy video compression), and computed the delay and quality (structural similarity, or SSIM) of each frame. To the best of our knowledge, this is the first dataset giving an end-to-end frame-by-frame measurement of the quality and delay of a diverse array of commercial and research codebases for real-time Internet video.
They certainly have sophisticated mechanisms at their disposal; video transmission over a lossy channel is a very well-studied problem. However, when we did the measurement carefully, we found that these mechanisms don't always work that well in practice, and Salsify's (much simpler, functional) approach works as well or better (see the video above), or Figure 6(d) of the paper.
There may not be an academically interesting reason for this—perhaps commercial systems just have a large toolbox of loss-recovery strategies but aren't well-tuned about when to use each one, at least not for the network conditions that we evaluated. Salsify’s main innovation is in jointly controlling codec rate-control with transport congestion-control, not in inventing a better mechanism for loss recovery, so we don't want to overstress this point. However, we do think that our evaluation may be the first end-to-end measurement of frame-by-frame video quality and delay across a variety of current commercial and research codebases, and to the extent the findings show surprising weaknesses in the commercial status quo, the lesson may be that more such measurements are in order.
Current video codecs, including ours, cannot reliably predict the compressed size of an individual video frame in advance. Instead, codecs generally try to control the “bitrate” over several frames (e.g. through a VBV constraint). If the encoder makes a frame that's too big compared with what the transport protocol thinks the network can accommodate, the application will generally send it anyway, and then, knowing that it has probably just caused congestion, the application might then pause input to the video encoder for a little bit to allow the congestion that it just caused to clear.
This is not a great strategy, because the network still gets congested. A better plan, if the encoder makes a compressed frame that's too big, would be to tell the encoder to try again—either by re-encoding the same frame with lower quality settings, or by plucking a new frame off the camera and encoding that more-recent frame—with quality settings more likely to produce a compressed frame that won't congest the network. In effect, rather than having a fixed frame rate, it would be better to send frames only at times when they won't provoke packet loss and queueing delay.
Salsify uses a purely functional video codec to achieve this. Our video codec is 100% conformant with the Google VP8 format, but has one major trick: the encode and decode functions are purely functional, with no side effects, and represent the inter-frame “state” of the decoder explicitly. This allows Salsify to make the encoder compress a frame relative to an arbitrary decoder state, allowing the application to safely skip frames on output from the encoder, not just on input.
Salsify's codec allows the sender to send frames when it's pretty sure they won't congest the network (discarding already-encoded frames if necessary) rather than sticking to a fixed frame rate. It also allows the codec to produce frames that more closely match the available network capacity, by simply producing two versions of each frame: one with slightly higher quality than the previous successful option, and one with slightly lower quality. The application chooses from among these options (or no option) after seeing the actual compressed size of each option.
Maybe! If mainstream video encoders could allow the application to discard already-encoded frames, and if they could accurately hit a frame-size target, then Salsify's purely functional video codec would probably not be necessary. The first may be achievable with smart use of reference invalidation (although not clear—what about ancillary inter-frame state stored in probability tables?). The second may be achievable with more work (and more cycles) put into intra-frame rate control. So, it's possible. But it was a heck of a lot easier to do when the codec exposes a way to save and restore its internal state.
No, that’s pretty much all we mean by it. If you have that, then the decoder can be described as a function that takes in a state and a compressed frame, and produces a modified state and an image for display. The encoder can be described as a function that takes in a state, an image, and some sort of quality settings, and produces a modified state and a compressed frame.
Not really—it requires a codec that can export and import its internal state, or can otherwise accurately hit a target size for an individual compressed frame and can “cancel” an already-encoded frame (including its effect on references and probability tables). Whether the codec is implemented in software or hardware isn’t important to Salsify.
Because it’s not a fair fight between codecs—Salsify is a different way of putting the pieces (compression and transmission) together, not a new compression format. To the extent these results persuade you of anything, they should make you think that in the domain of Internet videoconferencing, further innovation in video codecs may have reached the point of diminishing returns, but video systems still have lots of low-hanging fruit.
Not exactly, at least not in the way the term is typically used to apply to systems like Apex or SoftCast. These systems reach into the physical layer and jointly handle “source coding” (video compression) and “channel coding” (error correction coding and modulation). By contrast, Salsify is a conventional Internet application that sends UDP datagrams point-to-point over the Internet, just like Skype, FaceTime, and WebRTC (and almost like Hangouts, which sends UDP datagrams through a nearby Google CDN node).
Really just the overall architecture. The compressed video format is Google’s VP8, finalized in 2008 and largely superseded by VP9 and H.265. Salsify’s purely functional VP8 encoder/decoder is a lightly modified version of the one we used last year for ExCamera, when the benefit was in allowing us to subdivide video encoding into tiny threads (smaller than the interval between key frames) and parallelize across thousands of threads on AWS Lambda. This year, the benefit is in allowing Salsify to explore an execution path of the video codec without committing to it. Salsify's congestion-control scheme is based on Sprout-EWMA, which in turn is based on earlier work in packet-pair and packet-train estimation of available bandwidth. Salsify's loss-recovery strategy is related to Mosh’s “p-retransmissions.”
What’s new, then, is the way the pieces are put together: the joint control of codec rate-control and transport congestion-control, and the use of a functional video codec to send encoded frames only at moments when the network can accomodate them.
Legally, sure, we'd love that, and we’re eager to help. Technically, it may not be so easy. It would be a lot simpler if Salsify were just a better video codec, or a better transport protocol. Because Salsify is instead a better way of putting the pieces together, we expect it will be harder to retrofit into an existing application without significant refactoring. In our conversations with industry, we’ve found that the burden of proof will be high to demonstrate (1) that Salsify’s gains are real, and (2) that they can’t be achieved with less-intrusive surgery to existing applications.
Standardize an interface to export and import the encoder’s and decoder’s internal state between frames! Even if the format of that state is opaque and unstandardized across codecs, that’s okay. Last year, we demonstrated how doing so can allow fine-grained parallelization of video encoding. This year, it’s letting Salsify explore execution paths of the video codec without committing to them (to match each frame’s compressed size to the network capacity, and skip already-encoded frames without penalty). We can’t prove that a save/restore interface is strictly necessary to achieve this performance, but it makes things a heck of a lot easier and simpler. This should be a requirement for all codecs going forward: exposing the encoder and decoder only as stream operators with inaccessible, mutable, internal state is terribly limiting and inconvenient.
It's not a very interesting reason. Salsify comes from an older project called “ALFalfa,” for the use of Application Layer Framing in video. Alfalfa gave way to Sprout, a congestion-control scheme intended for real-time applications, and now Salsify, a new design where congestion-control (in the transport protocol) and rate control (in the video codec) are jointly controlled. Alfalfa, Sprout, and Salsify are all plantish foods.
Salsify is led by Sadjad Fouladi, a doctoral student in computer science at Stanford University, along with fellow Stanford students John Emmons, Emre Orbay, and Riad S. Wahby, as well as Catherine Wu, a junior at Saratoga High School in Saratoga, California. The project is advised by Keith Winstein, an assistant professor of computer science.
Salsify was funded the National Science Foundation and the Defense Advanced Research Projects Agency (DARPA). Salsify has also received support from Google, Huawei, VMware, Dropbox, Facebook, and the Stanford Platform Lab.