Kevin Shinpaugh and Glenda Scales
Jason Lockhart and Patricia Arvin
Installing the ethernet cable.
Preparing the floor.
One of the many volunteers — mostly students — who worked on the supercomputer installation.
Volunteers assemble the G5s.
One of the volunteers. Photos by John McCormick.
A conversation in February 2003 planted a seed that resulted, nine months later, in the
third-fastest computer in the world.
Last February, Srinidhi
Varadarajan, assistant professor of computer science, and Jason Lockhart,
who directs research computing efforts for the College of Engineering
at Virginia Tech, had a hallway conversation about increasing the capacity
of the 200-node cluster being used by some 80 researchers.
that he and colleagues in computer science, physics, chemistry, aerospace
and ocean engineering, biology, and biochemistry, four of whom are National
Science Foundation (NSF) Career Award holders, had submitted a proposal
to the NSF for a Major Research Instrumentation grant. My job is
to support faculty and help acquire resources, Lockhart says. So,
a group from the College of Engineering and the universitys information
systems division traveled to Austin to visit with Dell, and the
roller coaster ride began.
director of research/cluster computing for the university, recalls, Dell
said we could do 1,000 machines with dual processors and the idea struck
me we could be a large super-computer player. But I knew we would
need six months to prepare the facility. He received an immediate
go from Erv Blythe, vice president for information technology
(IT). Glenda Scales, College of Engineering assistant dean for computing,
brought together the College of Engineering, Office of the Vice Provost
for Research, IT, and others, and a financial plan was soon completed.
Dell did not continue as a partner, but they had put our foot on the path, says
Patricia Arvin, IT associate vice presidentIntel, HP, IBM, and AMD
were all trying to come up with ways to work with us, says Lockhart.
But the prices were
out of reach and IBMs 970 chip would not be available in time to
allow the new Virginia Tech cluster to be ranked. We were ready
to do a build-it-yourself computer cluster without a major vendor partner,
be the first time. Faculty, staff, and students built the 200-node, 400-CPU
AMD cluster. But then, at 1 p.m. on June 23, Lockhart watched Apple announce
its new G5 at the World Wide Developer Conference.
By 3 p.m., the campus Apple rep had put together a conference call with Pat
Arvin and Apple headquarters, he says.
On June 30, Apple
flew Arvin and Varadarajan to California. They wanted to make sure
we had the expertise before they committed the first G5s to us,
Lockhart says. Within the week, Apple and Virginia Tech decided the university
would build a G5 super computer.
It was not
a turnkey solution; but it was a brand-name computer and had the preferred
chip the new IBM 970, says Arvin.
The aimed-for sustained performance of 10 teraflops and peak of 17 teraflops
(17 trillion floating-point operations per second) would significantly
improve computing resources for the universitys researchers and
attract federal support as part of the national cyber-infrastructure project.
And, if the Virginia Tech team could build a terascale computer by fall,
it had a shot at being ranked by the Top500 organization as among the
fastest computers in the world.
With little more
than $5 million to spend and only months to build the machine in time
to be ranked, the group was willing to take risks on new technology. They
ordered 1,100 G5s, which were not only untried but not yet available.
next critical component was the primary communications layer, says
Arvin. We chose InfiniBand technology by Mellanox because it is
five times faster than the next fastest communication fabric.
solid state communication switches and new fast cable technology to link
the nodes, providing the ability to transfer packets of information within
a cluster at very low latency and the option to operate in parallel when
solving a problem. In fact, with InfiniBand, the entire cluster
is as fast as a bus on a motherboard, says Varadarajan.
was new, but it was being backed by companies like Intel, Sun, and Oracle,
says Varadarajan. So I called the CEO, Eyal Waldman, and a few days
later he had a team here.
Mellanox flew engineers
from Israel to Apples California headquarters in order to create
the driver software that allows the computers to use the InfiniBand system.
Work that should have taken six to nine months they did in weeks,
The second communication
layer the Cisco Gigabit Ethernet switches and cables manages
programs, such as loading software. The Cisco gigabit allowed the
needed software to be loaded in hours instead of days, says Arvin.
It also provides communication outside the cluster, such as when a part
of the cluster is put on the National Lambda Rail (NLR), where it can
be part of national computational grid projects.
There has been a global effort to develop and refine the software
for this project, says Arvin.
A cluster is a lot of relatively simple computers, says Cal
Ribbens, director of the Laboratory for Advanced Scientific Computing
and Applications. The key is how you hook them together to become
a unified resource to solve big problems. Everything we can solve on the
supercomputer we could solve on a single machine but it would take
months or years instead of minutes or hours.
MPI and BLAS software
were loaded onto each machine. MPI, or Message Passing Interface, is a
protocol for communicating between programs that are part of the same
parallel computation, Ribbens says. When a problem requires several computers,
it is divided among the units. But at critical junctures, the computers
share information in order to proceed with the computation. MPI allows
instructions specific to a task, such as how to divide the work and at
what points to communicate. MPI code will run on many machines across
D.K. Pandas research group at Ohio State University and researchers at Argonne National
Laboratory were adapting MPI to run with InfiniBands new G5 operating
system driver. Varadarajan worked with both groups to optimize the MPI.
In the end, the Virginia Tech group chose the Ohio State high performance
implementation for the benchmark.
BLAS stands for
basic linear algebra subprograms, which are the building blocks of many
scientific computing applications. The software performs such functions
extremely quickly. Varadarajan contacted Kazushige Goto, who works at
the Japanese patent office but whose hobby is to optimize mathematical
operations on matrices such as the subprograms in BLAS. Goto
is a wizard at getting the last ounce of performance from a chip using
these programs, Ribbens says.
In July, Varadarajan, Ribbens, and Goto began to tune the BLAS.There is a huge amount
of code detail in the BLAS. Generally, a 1 or 2 percent performance improvement
requires weeks of work, says Varadarajan. But Goto makes binary
libraries available that can improve efficiency to 80 percent. That is
the equivalent of tens of millions of dollars because otherwise you would
have to buy additional machines to complete the computations that Goto
has enabled the software to handle.
Goto sent Ribbens and Varadarajan code to try. Goto bumped the efficiency to 78 percent
(as of the first of September) by writing codes specifically for the G5
processor, says Varadarajan. Every week or so, he gives us
a new routine for a particular matrix computation.
The faster the BLAS, the faster most users applications will run on the cluster, Ribbens says. Also, the faster the cluster will run Linpack, a benchmark computation that is the supercomputer version of the 100-yard dash.
The Linpack benchmark
is one, huge computation that will take hours even for 1,100 nodes. It
tests the whole machine the CPUs and the interconnection network
in solving a typical computational science problem. The name comes
from math software called linear pack. Even though its a standard
computation, it is very tunable, says Ribbens. For instance,
you can make choices regarding data distribution that can make the difference
between 10 teraflops and 10.2 teraflops, says Ribbens. That
is a big deal in the ranking, but wont be noticed much by most researchers.
Its like the difference between 10.5 seconds and 10 seconds is a
big deal to a sprinter, but not if youre trying to get to the grocery
Just as critical as the hardware and software to the final success of the terascale project
is the facility. You cant just empty the gym and put in 1,100
nodes, Arvin says. Virginia Tech has 10,000 square feet of raised-floor
air-conditioned computer space that has been used to house several generations
of mainframes and distributed computing facilities since the 1980s.
As it turned out, they would only need 3,000 square feet. But the power had to be upgraded from 1.5 to 3 megawatts (one megawatt is enough electricity to power 500 Virginia homes for a year). An emergency generator system and 260 tons of additional cooling capacity were also added.
The biggest challenge
was AC design. The traditional system, which cools from under the
floor, would have required nine large units and would have created 60
mph winds under the floor, says Shinpaugh. Liebert looked
at our environment and recommended their new extreme high-density system,
with external chillers and copper lines overhead to rack-mounted heat
exchangers with R-134A refrigerant and an overhead chiller unit providing
more than 2 million BTUs of cooling capacity. They also provided the custom-built
racks for the computers.
Arvin points out that the Liebert system saved a lot of space. For example, Oak Ridge National Lab uses 40,000 square feet for their eight-megawatt computing facility. By that rule of thumb, we would have needed 20,000 square feet.
All the pieces were brought together the second week of September by teams of volunteers
most of them students — who unpacked boxes, powered up machines, installed
driver cards, transferred software, and put the machines into the racks
at a rate of about 50 an hour testing them several more times along
the way. Another team connected power and communication fiber.
Within a week, the
cluster was being brought up to speed and tested. By the first of October,
Varadarajan was able to tell the IS group that Virginia Tech had a world-class
computer. It was tested by Top500 in October and the announcement was
made in November that Virginia Techs terascale facility is third
in the world — behind the Earth Simulator in Japan and the ASCI-Q (Advanced
Simulation and Computing Initiative) supercomputer at Los Alamos National