Q: What is the purpose of the Pantheon?
A: New Internet congestion-control schemes from the
academic community often have to reinvent the wheel in their
evaluations. We saw this in Sprout (NSDI 2013), Verus
(SIGCOMM 2015), PCC (NSDI 2015), Copa (NSDI 2018), Vivace
(NSDI 2018), etc. All of these academic groups had to
develop an experimental testbed and cultivate a group
of runnable comparator schemes (TCP Cubic, Vegas, etc.) to compare
against. Meanwhile, schemes from organizations like Google
(e.g. BBR) are evaluated on billions of real-world
flows—resources few academic groups can match.
The Pantheon is a community evaluation platform that reduces the need for scheme designers to reinvent this wheel. We package 17 different congestion-control schemes into one repository, all of them continuously verified to compile and run by a continuous integration system. These schemes all use the developers' original implementations (via submodule reference), and each one is wrapped with a simple Python driver that exposes the same interface for each scheme (essentially: start a full-throttle flow, and stop the flow). These schemes can be used as comparators for any congestion-control evaluation. We welcome further contributions from the academic community—just send a pull request pointing to your Git repository and add one Python wrapper.
In addition, we host a testbed of measurement nodes around the world to evaluate the Pantheon schemes. Some are on LTE and other networks in different countries (the USA, Colombia, Brazil, India, China, Mexico), and some are in cloud datacenters belonging to AWS and GCE. Every few days, we run each of the Pantheon's congestion-control schemes in a variety of workloads (single flow, multiple flows) across a variety of network paths between these nodes. The results, including raw packet traces, are publicly archived on this website and can be used by anyone.
Q: Which congestion-control schemes are in the Pantheon?
A: Currently the Pantheon includes the following
schemes. Each is included by reference to its
original implementation. The submodule references are
in
the third_party directory, and the corresponding Python wrappers are
in
the src directory.
Q: Where are the test nodes?
A: Six nodes have both wired and cellular connections:
in Stanford (USA), Guadalajara (Mexico), São Paulo
(Brazil), Bogotá (Colombia), New Delhi (India), and
Beijing (China). The non-U.S. machines are in
commercial colocation facilities in each country. These
communicate (over their wired and cellular connections,
in both the uplink and downlink directions) with AWS
EC2 nodes in the nearest EC2 datacenters. In addition,
we have nodes in GCE datacenters in London (UK), Iowa
(USA), Tokyo (Japan), and Sydney (Australia). Stanford
pays for the cellular and wired connectivity in these
locations.
Q: What measurements are done on a regular basis?
A: The Pantheon performs several types of measurements on a roughly weekly basis. All
measurements run a particular congestion-control scheme between two endpoints,
measuring the departure time of each IP datagram (at the sender) and the
arrival time of the same IP datagram (at the receiver), if it arrives. These
raw logs are available for each measurement. For each scheme, we also calculate
and plot aggregate statistics, e.g., the throughput, one-way delay (95th
percentile), loss rate, etc.
For each scheme, we measure a variety of workloads:
Q: How do I interpret the plots?
A: Each results page includes two plots, summarizing
the aggregate statistics of the evaluation. The first
is a scatter plot showing the results each individual run
(3x or 10x per scheme):
On this plot, the "best" schemes are in the upper-right corner. The best throughput is at the top of the plot, and the best one-way delay is at the right-hand side of the plot. Each result is shown individually, giving an indication of the amount of variation during the run. The schemes are run in round-robin fashion to make sure they are evaluated as fairly as possible in the presence of path variability.
The second plot shows an average of each scheme's performance and is otherwise the same:
Q: How do I interpret the raw logs?
A: Each archive of raw logs contains three classes of files:
(ingress) <packet entry time in milliseconds> + <packet size in bytes> <flow ID> (egress) <packet exit time in milliseconds> - <packet size in bytes> <one-way delay in milliseconds> <flow ID>
Q: I have a new scheme—will you test it for me?
A: If you are from the academic community (e.g.,
communities like ACM SIGCOMM/CoNEXT/MobiCom/MobiSys/HotNets,
Usenix NSDI, or IETF/IRTF groups like TCPM, RMCAT, or
ICCRG), and your scheme behaves reasonably in emulation,
yes! Please refer to
the README,
and especially the "How to add your own congestion control"
section. As soon as you submit a pull request, the Travis-CI
system will automatically verify that your scheme compiles
and runs in emulation. If you have any questions, please get
in touch by emailing "pantheon-stanford [at] googlegroups.com".
Q: I'd like to host a node in my location—can I?
A: Please get in touch by emailing "pantheon-stanford [at] googlegroups.com".
Q: I'd like to cite the Pantheon or its results in an upcoming
academic paper. How should I cite it?
A: Please feel free to cite as, e.g.,: Francis Y. Yan, Jestin Ma, Greg Hill, Deepti Raghavan,
Riad S. Wahby, Philip Levis, and Keith Winstein, Pantheon: the training ground for Internet congestion-control research,
measurement at https://pantheon.stanford.edu/result/NNN
Q: What useful things do the measurements show? What can be learned from just 16 nodes and the network paths between them?
A: The Pantheon certainly doesn't have nearly the scale of a
commercial website or CDN, but it is larger and more comprehensive
than most academic congestion-control evaluations have been able to
access (both in its coverage of international networks, especially
cellular ones, and its collection of emulators,
calibrated-to-real-life as well as pathological). One thing we see is
that the performance of congestion-control schemes is quite variable:
different schemes perform quite differently on different paths, even
when the bottleneck link technology seems to be the same. For example,
Winstein's Sprout
scheme
consistently performs
well on a U.S. cellular network (where it was designed), and not
as well
in India or
Colombia. Other
schemes also demonstrate surprising (but consistent) variations in
performance, captured in the published packet traces.
Q: What's a “calibrated emulator”? What's it calibrated to?
A: The Pantheon's results indicate that simple network
emulators (a constant-rate bottleneck with propagation delay, random
loss, and DropTail loss) can be calibrated to match the performance of
real Internet paths. We define a new metric for end-to-end emulation
accuracy (how well a congestion-control protocol matches its
throughput and delay when running over the emulator vs. on a real
path) and find that, using a Bayesian optimization search procedure,
it's possible to find a single emulator that successfully causes 10+
protocols to each get the same throughput and delay (within 20% on
average) as they do over the real network path. This somewhat goes
against a traditional view in networking, which emphasizes the
faithful emulation of mechanisms and possible failure modes
(jitter, reordering, explicit entry and departure of cross traffic),
and has historically lacked a figure of merit for the end-to-end fidelity
of an emulator. These calibrated emulators aid training of new
congestion-control schemes, because it's possible to train many
variants in parallel over an emulator. Users can find results over the
calibrated emulators in
the Emulation
tab of the “Find results” section.
Q: How accurate are the calibrated emulators at predicting the performance on real network paths?
A: On average, a congestion-control scheme's throughput and delay
(when running over the emulated network path) will be within about 17% on
average of the same values when running over the real path.
Q: What's a “pathological emulator”?
A: The Pantheon adopted a suggestion from Google's BBR team and
regularly tests the various schemes over a series of emulators for
pathological network conditions.
Q: The results for scheme x look wrong. What version did you test?
A: The exact Git commit of each scheme is included in the PDF report (accessible via the "Full report" link) for every set of results.
Q: Why do you tunnel all of the traffic within a UDP tunnel? Does this affect the results?
A: Pantheon uses an instrumented tunnel to run and evaluate each scheme. It is essentially a virtual private network (VPN), encapsulating the original packet along with an assigned unique identifier (UID, 8 bytes) in a UDP datagram:
| IP | UDP | UID | original IP datagram |There are three principal benefits:
To verify that Pantheon-tunnel does not substantially alter the performance of transport protocols, we picked three TCP schemes (Cubic, Vegas, and BBR) and ran each scheme 50 times inside and outside the tunnel for 30 seconds each time, from AWS India to India, measuring the mean throughput and 95th-percentile one-way delay of each run. For BBR running outside the tunnel, we were only able to measure the average throughput (not delay) because BBR's native performance appears to rely on TCP segmentation offloading, which prevents a precise measurement of per-packet delay without the tunnel's encapsulation.
Q: Do you have results with cross traffic?
A: Yes— both in the sense that the real-world tests are
conducted over wide-area Internet paths that are exposed to contending
cross-traffic that we don't control (and the calibrated emulators are
calibrated to match the same conditions and results), but also
in that we run our own tests with cross-traffic flows between the same
pairs of endpoints. Search for Flow Scenario: “Multiple” to see this latter set of tests.
Q: What about web-like workloads, or measurements of flow completion time?
A: Unfortunately we have implemented a
least-common-denominator interface to the 17+ congestion-control
schemes in the Pantheon, and the only common interface is to start or
stop a full-throttle flow of each type. Most schemes do not support
an abstraction like, “Run for exactly n bytes”, which
limits the kinds of metrics that the Pantheon can measure.
Q: What about Wi-Fi?
A: We don't currently have any Wi-Fi-including network paths
in the Pantheon but would like to add some.
Q: Who funded the Pantheon?
A: This work was supported by NSF grant CNS-1528197, DARPA grant
HR0011-15-2-0047, Intel/NSF grant CPS-Security1505728, the Secure Internet of
Things Project, and by Huawei (Protocol Research Lab, 2012 Labs),
VMware, Google, Dropbox, Facebook, and the Stanford Platform Lab.