Pantheon

Pantheon: Questions and Answers

Q: What is the purpose of the Pantheon?
A: New Internet congestion-control schemes from the academic community often have to reinvent the wheel in their evaluations. We saw this in Sprout (NSDI 2013), Verus (SIGCOMM 2015), PCC (NSDI 2015), Copa (NSDI 2018), Vivace (NSDI 2018), etc. All of these academic groups had to develop an experimental testbed and cultivate a group of runnable comparator schemes (TCP Cubic, Vegas, etc.) to compare against. Meanwhile, schemes from organizations like Google (e.g. BBR) are evaluated on billions of real-world flows—resources few academic groups can match.

The Pantheon is a community evaluation platform that reduces the need for scheme designers to reinvent this wheel. We package 17 different congestion-control schemes into one repository, all of them continuously verified to compile and run by a continuous integration system. These schemes all use the developers' original implementations (via submodule reference), and each one is wrapped with a simple Python driver that exposes the same interface for each scheme (essentially: start a full-throttle flow, and stop the flow). These schemes can be used as comparators for any congestion-control evaluation. We welcome further contributions from the academic community—just send a pull request pointing to your Git repository and add one Python wrapper.

In addition, we host a testbed of measurement nodes around the world to evaluate the Pantheon schemes. Some are on LTE and other networks in different countries (the USA, Colombia, Brazil, India, China, Mexico), and some are in cloud datacenters belonging to AWS and GCE. Every few days, we run each of the Pantheon's congestion-control schemes in a variety of workloads (single flow, multiple flows) across a variety of network paths between these nodes. The results, including raw packet traces, are publicly archived on this website and can be used by anyone.

Q: Which congestion-control schemes are in the Pantheon?
A: Currently the Pantheon includes the following schemes. Each is included by reference to its original implementation. The submodule references are in the third_party directory, and the corresponding Python wrappers are in the src directory.

TCP Cubic (Linux default)
TCP Vegas
TCP BBR
QUIC
WebRTC
LEDBAT
PCC
Verus
SCReAM
the computer-generated 2014 "100x" Tao RemyCC scheme
Sprout
Copa
Vivace
Indigo
FillP (WIP)

Q: Where are the test nodes?
A: Six nodes have both wired and cellular connections: in Stanford (USA), Guadalajara (Mexico), São Paulo (Brazil), Bogotá (Colombia), New Delhi (India), and Beijing (China). The non-U.S. machines are in commercial colocation facilities in each country. These communicate (over their wired and cellular connections, in both the uplink and downlink directions) with AWS EC2 nodes in the nearest EC2 datacenters. In addition, we have nodes in GCE datacenters in London (UK), Iowa (USA), Tokyo (Japan), and Sydney (Australia). Stanford pays for the cellular and wired connectivity in these locations.

Q: What measurements are done on a regular basis?
A: The Pantheon performs several types of measurements on a roughly weekly basis. All measurements run a particular congestion-control scheme between two endpoints, measuring the departure time of each IP datagram (at the sender) and the arrival time of the same IP datagram (at the receiver), if it arrives. These raw logs are available for each measurement. For each scheme, we also calculate and plot aggregate statistics, e.g., the throughput, one-way delay (95th percentile), loss rate, etc.

For each scheme, we measure a variety of workloads:

Single-flow tests: one flow runs, full-throttle, for 30 seconds
Multiple-flow tests: three flows run, full-throttle, for 30, 20, and 10 seconds respectively

...over a variety of network paths:

Colocated node to nearby AWS region, via cellular network (uplink and downlink), repeated 3 times per scheme in a round-robin fashion
Colocated node to nearby AWS region, via wired network (uplink and downlink), 10 times per scheme, round-robin
Google Cloud Engine node to a GCE node in a different region, via wired network (both directions), 10 times per scheme, round-robin
Network emulators calibrated to match a real network path, 10 times per scheme
Network emulators designed to exhibit a certain pathological behavior, 10 times per scheme

Q: How do I interpret the plots?
A: Each results page includes two plots, summarizing the aggregate statistics of the evaluation. The first is a scatter plot showing the results each individual run (3x or 10x per scheme):

On this plot, the "best" schemes are in the upper-right corner. The best throughput is at the top of the plot, and the best one-way delay is at the right-hand side of the plot. Each result is shown individually, giving an indication of the amount of variation during the run. The schemes are run in round-robin fashion to make sure they are evaluated as fairly as possible in the presence of path variability.

The second plot shows an average of each scheme's performance and is otherwise the same:

Q: How do I interpret the raw logs?
A: Each archive of raw logs contains three classes of files:

pantheon_metadata.json: metadata containing tested schemes, number of flows, etc.
<cc>_stats_run<ID>.log: contains the start and end times of the experiment, and clock offsets.
<cc>_datalink_run<ID>.log/<cc>_acklink_run<ID>.log: packet logs on the datalink/acklink with each line in one of the two types:

(ingress) <packet entry time in milliseconds> + <packet size in bytes> <flow ID>
(egress) <packet exit time in milliseconds> - <packet size in bytes> <one-way delay in milliseconds> <flow ID>

Q: I have a new scheme—will you test it for me?
A: If you are from the academic community (e.g., communities like ACM SIGCOMM/CoNEXT/MobiCom/MobiSys/HotNets, Usenix NSDI, or IETF/IRTF groups like TCPM, RMCAT, or ICCRG), and your scheme behaves reasonably in emulation, yes! Please refer to the README, and especially the "How to add your own congestion control" section. As soon as you submit a pull request, the Travis-CI system will automatically verify that your scheme compiles and runs in emulation. If you have any questions, please get in touch by emailing "pantheon-stanford [at] googlegroups.com".

Q: I'd like to host a node in my location—can I?
A: Please get in touch by emailing "pantheon-stanford [at] googlegroups.com".

Q: I'd like to cite the Pantheon or its results in an upcoming academic paper. How should I cite it?
A: Please feel free to cite as, e.g.,: Francis Y. Yan, Jestin Ma, Greg Hill, Deepti Raghavan, Riad S. Wahby, Philip Levis, and Keith Winstein, Pantheon: the training ground for Internet congestion-control research, measurement at https://pantheon.stanford.edu/result/NNN

Q: What useful things do the measurements show? What can be learned from just 16 nodes and the network paths between them?
A: The Pantheon certainly doesn't have nearly the scale of a commercial website or CDN, but it is larger and more comprehensive than most academic congestion-control evaluations have been able to access (both in its coverage of international networks, especially cellular ones, and its collection of emulators, calibrated-to-real-life as well as pathological). One thing we see is that the performance of congestion-control schemes is quite variable: different schemes perform quite differently on different paths, even when the bottleneck link technology seems to be the same. For example, Winstein's Sprout scheme consistently performs well on a U.S. cellular network (where it was designed), and not as well in India or Colombia. Other schemes also demonstrate surprising (but consistent) variations in performance, captured in the published packet traces.

Q: What's a “calibrated emulator”? What's it calibrated to?
A: The Pantheon's results indicate that simple network emulators (a constant-rate bottleneck with propagation delay, random loss, and DropTail loss) can be calibrated to match the performance of real Internet paths. We define a new metric for end-to-end emulation accuracy (how well a congestion-control protocol matches its throughput and delay when running over the emulator vs. on a real path) and find that, using a Bayesian optimization search procedure, it's possible to find a single emulator that successfully causes 10+ protocols to each get the same throughput and delay (within 20% on average) as they do over the real network path. This somewhat goes against a traditional view in networking, which emphasizes the faithful emulation of mechanisms and possible failure modes (jitter, reordering, explicit entry and departure of cross traffic), and has historically lacked a figure of merit for the end-to-end fidelity of an emulator. These calibrated emulators aid training of new congestion-control schemes, because it's possible to train many variants in parallel over an emulator. Users can find results over the calibrated emulators in the Emulation tab of the “Find results” section.

Q: How accurate are the calibrated emulators at predicting the performance on real network paths?
A: On average, a congestion-control scheme's throughput and delay (when running over the emulated network path) will be within about 17% on average of the same values when running over the real path.

Q: What's a “pathological emulator”?
A: The Pantheon adopted a suggestion from Google's BBR team and regularly tests the various schemes over a series of emulators for pathological network conditions.

We run a single full-throttle flow over:
- Token-bucket based policers with bottleneck link rates of 12 Mbps, 60 Mbps, or 108 Mbps
- Paths with severe ACK aggregation on the return path, either 1 ACK every 100 milliseconds or 10 ACKs every 200 milliseconds (both paths are 12 Mbps in the forward direction)
- Paths with perverse DropTail thresholds, e.g. bottleneck buffer sizes of 1 bandwidth-delay product, BDP/2, BDP/3, or BDP/10
We also run multi-flow tests over the same emulated paths, with one full-throttle flow starting at the beginning, then a second joining after 10 seconds, then a third after a further 10 seconds:
- Token-bucket based policers with bottleneck link rates of 12 Mbps, 60 Mbps, or 108 Mbps
- Paths with severe ACK aggregation on the return path, either 1 ACK every 100 milliseconds or 10 ACKs every 200 milliseconds (both paths are 12 Mbps in the forward direction)
- Paths with perverse DropTail thresholds, e.g. bottleneck buffer sizes of 1 bandwidth-delay product, BDP/2, BDP/3, or BDP/10

Q: The results for scheme x look wrong. What version did you test?
A: The exact Git commit of each scheme is included in the PDF report (accessible via the "Full report" link) for every set of results.

Q: Why do you tunnel all of the traffic within a UDP tunnel? Does this affect the results?
A: Pantheon uses an instrumented tunnel to run and evaluate each scheme. It is essentially a virtual private network (VPN), encapsulating the original packet along with an assigned unique identifier (UID, 8 bytes) in a UDP datagram:

| IP | UDP | UID | original IP datagram |

There are three principal benefits:

Many of our cellular nodes are behind a NAT. The tunnel allows us to evaluate an arbitrary scheme in either direction (uplink or downlink), without regard for which side (sender or receiver) wants to initiate the connection.
The tunnel prepends a unique sequence number to each datagram, allowing Pantheon to measure the one-way delay of each datagram without worrying about disambiguating duplicate packets.
All packets look the same to the network infrastructure (UDP in IP), meaning the Pantheon measures and isolates the difference between different congestion-control schemes (without the possible confounding effect of different encapsulation formats on the wire).

...and two main downsides:

The Pantheon uses a smaller MTU than 1500. Some academic schemes assume an MTU of 1500 and don't perform PMTU discovery; we patch these to reduce the size of their packets.
All packets look the same to the network infrastructure (UDP in IP), meaning that Pantheon cannot evaluate the performance impact of different headers (e.g. DSCP or TOS bits, or UDP vs. TCP IP protocol types). Pantheon only evaluates Internet congestion-control schemes insofar as they decide when and how many datagrams to send.

To verify that Pantheon-tunnel does not substantially alter the performance of transport protocols, we picked three TCP schemes (Cubic, Vegas, and BBR) and ran each scheme 50 times inside and outside the tunnel for 30 seconds each time, from AWS India to India, measuring the mean throughput and 95th-percentile one-way delay of each run. For BBR running outside the tunnel, we were only able to measure the average throughput (not delay) because BBR's native performance appears to rely on TCP segmentation offloading, which prevents a precise measurement of per-packet delay without the tunnel's encapsulation.

We ran a two-sample Kolmogorov-Smirnov test for each pair of statistics (the 50 runs inside vs. outside the tunnel for each scheme’s throughput and delay). No test found a statistically significant difference below p-value < 0.2.

Q: Do you have results with cross traffic?
A: Yes— both in the sense that the real-world tests are conducted over wide-area Internet paths that are exposed to contending cross-traffic that we don't control (and the calibrated emulators are calibrated to match the same conditions and results), but also in that we run our own tests with cross-traffic flows between the same pairs of endpoints. Search for Flow Scenario: “Multiple” to see this latter set of tests.

Q: What about web-like workloads, or measurements of flow completion time?
A: Unfortunately we have implemented a least-common-denominator interface to the 17+ congestion-control schemes in the Pantheon, and the only common interface is to start or stop a full-throttle flow of each type. Most schemes do not support an abstraction like, “Run for exactly n bytes”, which limits the kinds of metrics that the Pantheon can measure.

Q: What about Wi-Fi?
A: We don't currently have any Wi-Fi-including network paths in the Pantheon but would like to add some.

Q: Who funded the Pantheon?
A: This work was supported by NSF grant CNS-1528197, DARPA grant HR0011-15-2-0047, Intel/NSF grant CPS-Security1505728, the Secure Internet of Things Project, and by Huawei (Protocol Research Lab, 2012 Labs), VMware, Google, Dropbox, Facebook, and the Stanford Platform Lab.