Virginia Tech home page



Originally published in the Winter 2004 Virginia Tech Research Magazine.

Material appearing in the Virginia Tech Research Magazine may be reprinted provided the endorsement of a commercial product is not stated or implied. Please credit the researchers involved and Virginia Tech.

What do you think of this story? Let us know via e-mail.

How Virginia Tech built a supercomputer on a shoestring budget

By Susan Trulove

Related story: Why build a supercomputer?

Srinidhi Varadarajan

Kevin Shinpaugh and Glenda Scales

Jason Lockhart and Patricia Arvin

Installing the ethernet cable.

Preparing the floor.

One of the many volunteers — mostly students — who worked on the supercomputer installation.

Volunteers assemble the G5s.

One of the volunteers. Photos by John McCormick.

A conversation in February 2003 planted a seed that resulted, nine months later, in the third-fastest computer in the world.

Last February, Srinidhi Varadarajan, assistant professor of computer science, and Jason Lockhart, who directs research computing efforts for the College of Engineering at Virginia Tech, had a hallway conversation about increasing the capacity of the 200-node cluster being used by some 80 researchers.

Varadarajan mentioned that he and colleagues in computer science, physics, chemistry, aerospace and ocean engineering, biology, and biochemistry, four of whom are National Science Foundation (NSF) Career Award holders, had submitted a proposal to the NSF for a Major Research Instrumentation grant. “My job is to support faculty and help acquire resources,” Lockhart says. So, a group from the College of Engineering and the university’s information systems division traveled to Austin to visit with Dell, “and the roller coaster ride began.”

Kevin Shinpaugh, director of research/cluster computing for the university, recalls, “Dell said we could do 1,000 machines with dual processors and the idea struck me – we could be a large super-computer player. But I knew we would need six months to prepare the facility.” He received an immediate “go” from Erv Blythe, vice president for information technology (IT). Glenda Scales, College of Engineering assistant dean for computing, brought together the College of Engineering, Office of the Vice Provost for Research, IT, and others, and a financial plan was soon completed.

“Dell did not continue as a partner, but they had put our foot on the path,” says Patricia Arvin, IT associate vice president“Intel, HP, IBM, and AMD were all trying to come up with ways to work with us,” says Lockhart.

But the prices were out of reach and IBM’s 970 chip would not be available in time to allow the new Virginia Tech cluster to be ranked. “We were ready to do a build-it-yourself computer cluster without a major vendor partner,” says Arvin.

It wouldn’t be the first time. Faculty, staff, and students built the 200-node, 400-CPU AMD cluster. But then, at 1 p.m. on June 23, Lockhart watched Apple announce its new G5 at the World Wide Developer Conference.

By 3 p.m., the campus Apple rep had put together a conference call with Pat Arvin and Apple headquarters,” he says.

On June 30, Apple flew Arvin and Varadarajan to California. “They wanted to make sure we had the expertise before they committed the first G5s to us,” Lockhart says. Within the week, Apple and Virginia Tech decided the university would build a G5 super computer.

“It was not a turnkey solution; but it was a brand-name computer and had the preferred chip – the new IBM 970,” says Arvin.

Assembling hardware for a supercomputer

The aimed-for sustained performance of 10 teraflops and peak of 17 teraflops (17 trillion floating-point operations per second) would significantly improve computing resources for the university’s researchers and attract federal support as part of the national cyber-infrastructure project. And, if the Virginia Tech team could build a terascale computer by fall, it had a shot at being ranked by the Top500 organization as among the fastest computers in the world.

With little more than $5 million to spend and only months to build the machine in time to be ranked, the group was willing to take risks on new technology. They ordered 1,100 G5s, which were not only untried but not yet available.

“The next critical component was the primary communications layer,” says Arvin. “We chose InfiniBand technology by Mellanox because it is five times faster than the next fastest communication fabric.”

Infiniband combines solid state communication switches and new fast cable technology to link the nodes, providing the ability to transfer packets of information within a cluster at very low latency and the option to operate in parallel when solving a problem. “In fact, with InfiniBand, the entire cluster is as fast as a bus on a motherboard,” says Varadarajan.

“InfiniBand was new, but it was being backed by companies like Intel, Sun, and Oracle,” says Varadarajan. “So I called the CEO, Eyal Waldman, and a few days later he had a team here.”

Mellanox flew engineers from Israel to Apple’s California headquarters in order to create the driver software that allows the computers to use the InfiniBand system. “Work that should have taken six to nine months they did in weeks,” says Varadarajan.

The second communication layer – the Cisco Gigabit Ethernet switches and cables – manages programs, such as loading software. “The Cisco gigabit allowed the needed software to be loaded in hours instead of days,” says Arvin. It also provides communication outside the cluster, such as when a part of the cluster is put on the National Lambda Rail (NLR), where it can be part of national computational grid projects.

Adapting software

“There has been a global effort to develop and refine the software for this project,” says Arvin.

“A cluster is a lot of relatively simple computers,” says Cal Ribbens, director of the Laboratory for Advanced Scientific Computing and Applications. “The key is how you hook them together to become a unified resource to solve big problems. Everything we can solve on the supercomputer we could solve on a single machine – but it would take months or years instead of minutes or hours.”

MPI and BLAS software were loaded onto each machine. MPI, or Message Passing Interface, is a protocol for communicating between programs that are part of the same parallel computation, Ribbens says. When a problem requires several computers, it is divided among the units. But at critical junctures, the computers share information in order to proceed with the computation. MPI allows instructions specific to a task, such as how to divide the work and at what points to communicate. MPI code will run on many machines across platform types.

D.K. Panda’s research group at Ohio State University and researchers at Argonne National Laboratory were adapting MPI to run with InfiniBand’s new G5 operating system driver. Varadarajan worked with both groups to optimize the MPI. In the end, the Virginia Tech group chose the Ohio State high performance implementation for the benchmark.

BLAS stands for basic linear algebra subprograms, which are the building blocks of many scientific computing applications. The software performs such functions extremely quickly. Varadarajan contacted Kazushige Goto, who works at the Japanese patent office but whose hobby is to optimize mathematical operations on matrices – such as the subprograms in BLAS. “Goto is a wizard at getting the last ounce of performance from a chip using these programs,” Ribbens says.

In July, Varadarajan, Ribbens, and Goto began to tune the BLAS.“There is a huge amount of code detail in the BLAS. Generally, a 1 or 2 percent performance improvement requires weeks of work, says Varadarajan. “But Goto makes binary libraries available that can improve efficiency to 80 percent. That is the equivalent of tens of millions of dollars because otherwise you would have to buy additional machines to complete the computations that Goto has enabled the software to handle.”

Goto sent Ribbens and Varadarajan code to try. “Goto bumped the efficiency to 78 percent (as of the first of September) by writing codes specifically for the G5 processor,” says Varadarajan. “Every week or so, he gives us a new routine for a particular matrix computation.”

The faster the BLAS, the faster most users’ applications will run on the cluster, Ribbens says. Also, the faster the cluster will run Linpack, a benchmark computation that is the supercomputer version of the 100-yard dash.

The Linpack benchmark is one, huge computation that will take hours even for 1,100 nodes. It tests the whole machine – the CPUs and the interconnection network – in solving a typical computational science problem. The name comes from math software called linear pack. “Even though it’s a standard computation, it is very tunable,” says Ribbens. “For instance, you can make choices regarding data distribution that can make the difference between 10 teraflops and 10.2 teraflops,” says Ribbens. “That is a big deal in the ranking, but won’t be noticed much by most researchers. It’s like the difference between 10.5 seconds and 10 seconds is a big deal to a sprinter, but not if you’re trying to get to the grocery store.”

Preparing the facility

Just as critical as the hardware and software to the final success of the terascale project is the facility. “You can’t just empty the gym and put in 1,100 nodes,” Arvin says. Virginia Tech has 10,000 square feet of raised-floor air-conditioned computer space that has been used to house several generations of mainframes and distributed computing facilities since the 1980s.

As it turned out, they would only need 3,000 square feet. But the power had to be upgraded from 1.5 to 3 megawatts (one megawatt is enough electricity to power 500 Virginia homes for a year). An emergency generator system and 260 tons of additional cooling capacity were also added.

The biggest challenge was AC design. “The traditional system, which cools from under the floor, would have required nine large units and would have created 60 mph winds under the floor,” says Shinpaugh. “Liebert looked at our environment and recommended their new extreme high-density system, with external chillers and copper lines overhead to rack-mounted heat exchangers with R-134A refrigerant and an overhead chiller unit providing more than 2 million BTUs of cooling capacity. They also provided the custom-built racks for the computers.”

Arvin points out that the Liebert system saved a lot of space. “For example, Oak Ridge National Lab uses 40,000 square feet for their eight-megawatt computing facility. By that rule of thumb, we would have needed 20,000 square feet.”

All the pieces were brought together the second week of September by teams of volunteers – most of them students — who unpacked boxes, powered up machines, installed driver cards, transferred software, and put the machines into the racks at a rate of about 50 an hour – testing them several more times along the way. Another team connected power and communication fiber.

Within a week, the cluster was being brought up to speed and tested. By the first of October, Varadarajan was able to tell the IS group that Virginia Tech had a world-class computer. It was tested by Top500 in October and the announcement was made in November that Virginia Tech’s terascale facility is third in the world — behind the Earth Simulator in Japan and the ASCI-Q (Advanced Simulation and Computing Initiative) supercomputer at Los Alamos National Laboratory.