Chapter 3. Database Hardware Benchmarking

After all the theory in the last chapter about what makes some systems perform well or poorly, you might be wondering just how your own system measures. There are several reasons to always do your own hardware benchmarks. The first is simply to be systematic about your performance process. If you always measure things yourself, you'll get a better feel for what good and bad performance looks like, one that can help tip you off to even subtle problems.

Second, in the case of disks in particular, problems here are a very common, underlying cause of database performance issues. If your disks are slow, and there are many ways that can happen, your database will likely be slow too. It's important when this happens to have accurate data on whether the problem is likely at the hardware or software level.

The goal of your basic hardware testing should be to look for large configuration errors, not to analyze every possible parameter. The sorts of problems you need to be most concerned about are really obvious. And detailed optimization is better done at the database application level.

CPU and memory benchmarking

The first thing worth testing on a new system is the speed of its memory, because if this is slow both the CPU and disks will suffer accordingly. You might wonder why this is so important. It's because database work is intensive in both these areas. PostgreSQL works with database data in 8 KB pages, and it's constantly shuffling those around to system memory as needed to satisfy queries. And looking through those pages for the specific rows needed is CPU-intensive. Even on benchmarks that focus on disk-heavy work, the speed of the underlying memory can have a surprisingly high impact on results.

memtest86+

One of the most valuable tools in the PC hardware technician's bag is memtest86+, a program whose primary purpose is to find intermittent errors on PC memory. It's a great way to burn-in new hardware and confirm that it works as expected.

You can download memtest86+ directly from its website from http://www.memtest.org/ and create a bootable CD to run the program. Note that it's now included as one of the boot options for many Linux distributions, both on the installer media and when starting the OS normally too. Any Ubuntu installer CD for example will include the memtest86+ current when that version was released. Generally, if the program does not fully recognize your CPU hardware information, you probably need a newer version to make sure you're getting accurate results.

Once you've started it from a bootable CD, memtest86+ reports the speed of the various memory at each of the cache levels it identifies. The most useful one is the MB/s value reported on the Memory: line. This will be how fast memory access is to the main system RAM, and is good for confirming performance matches baselines. It can be a quick way to confirm performance and stability if you're adjusting clocking on your memory or CPU to try and speed things up.

One problem with this program is that it requires physical access to the system to run, which isn't always practical. And since memtest86+ is only running a single process to test memory, its reported total speed number isn't really representative of the capabilities of modern processors. For that you'll need STREAM, which does run easily from a remote session too.

STREAM memory testing

STREAM is a memory bandwidth testing program that came from testing high-performance systems for scientific research. The program is hosted at http://www.cs.virginia.edu/stream/ and that site includes a variety of sample reports from the program too.

The STREAM project provides some binaries for several platforms, even Windows. The available Linux binary hasn't worked out well in my own testing. In addition, STREAM aims only to track the maximum bandwidth the system is capable of. One of the things that you should be interested in for PostgreSQL use is also how much memory bandwidth a single process can reach. As individual queries in PostgreSQL will only run on one CPU, if that number is low you may be disappointed with how lone queries run, even on an otherwise idle server.

To try and work on both these problems, for Linux systems my stream-scaling script available at http://github.com/gregs1104/stream-scaling tries to automate a more comprehensive view of your memory performance. It downloads the program, measures the total amount of cache in all the system processors, and automatically compiles STREAM to use a value much larger than that. It then loops from a single active thread (presumably running on a single core) upward until it's using every processor on the server.

Even if you're sold on this idea, it will take you a while to get your own library of references performance data assembled. A starting set from systems that I've confirmed results from myself is included on the site, and a few examples from there are included next.

STREAM and Intel vs. AMD

The following table shows the reported STREAM speeds by Processor/RAM for a few servers, as the number of cores utilized increases. All except for the one Opteron model (featuring eight sockets of six cores processors) are Intel systems. RAM included here is all Double data rate synchronous dynamic random access memory, more commonly known as DDR RAM. The main two standards in use in current hardware are DDR2 and DDR3, and the RAM DDR column shows which standard is used and what clock speed the memory is running at. The performance numbers shown for different total core counts are in units of MB/s:

              

This shows clear limiting of available bandwidth to a single core on all of the recent systems with many cores, relative to total system bandwidth. That particular problem has been more restrictive on AMD's systems than Intel's. Even though this large AMD system can achieve 27 GB/s when heavily tasked, a single process barely clears 4 GB/s.

This is exactly why there aren't more AMD processors included in this list. From mid-2008 to mid-2010, AMD has lagged considerably behind Intel in terms of their memory technology on systems with small numbers of processor sockets. They've fallen out of favor with the purchasing I've been involved in as a direct result. Note the large memory speed jump for single core results starting with the Intel 5500 series processor. That represents the introduction of Intel's Nehalem architecture. That included a shift to faster DDR3 RAM, among other major memory improvements, and was what pushed Intel far ahead for a solid two years.

An excellent graph showing just how wide the gap in memory performance introduced by this change appears at http://www.advancedclustering.com/company-blog/stream-benchmarking.html There you can see extensive results expanding on the preceding. There's a giant leap starting with the Intel 5500 series, compared to both earlier Intel designs and any AMD designs still using DDR2. As described in the previous chapter, memory speed isn't everything if it comes at a high enough cost, or if you're disk bound. But it's why the servers I have the most data to report on recently have all been Intel based.

But as usual, AMD has recently bounced back. The latest AMD processors released as of this writing are the 8 and 12 core 6100 series processors using their Magny-Cours architecture, none of which are shown here yet. These processors have closed much of the gap with Intel's designs, including a shift to DDR3. Like earlier AMD designs, they are particularly good for systems where a large number of processors and banks of RAM are involved. Intel's fast server hardware is much more focused on systems with one or two processor sockets; AMD servers with 4 or more sockets are easy to find. Early tests suggest that memory bandwidth may still be a bit slower than Intel low core counts. Check the stream-scaling site for the latest updates on processors tested for this particular aspect of their performance. But you should be collecting your own numbers, too!

CPU benchmarking

It's rather hard to find a CPU benchmark that is more representative of database performance more useful than just using a database to do something processor-intensive. You can easily build some quick, PostgreSQL-oriented CPU tests using the database psql client and its \timing feature, which shows you how long each statement takes to run. Here's an example that just exercises the CPU and memory, by adding the first million integers together with the always handy generate_series set returning function:

\timing
SELECT sum(generate_series) FROM generate_series(1,1000000);

Here's another more complicated example that may use some disk accesses too, in addition to stressing CPU/memory; depends on the amount of RAM in your server:

\timing
CREATE TABLE test (id INTEGER PRIMARY KEY);
INSERT INTO test VALUES (generate series(1,100000));
EXPLAIN ANALYZE SELECT COUNT(*) FROM test;

Both the insertion time and how long it takes to count each value are interesting numbers. The latter also includes some CPU/memory-intensive work related to updating the hint bit values PostgreSQL uses to track transaction visibility information; see Chapter 7, Routine Maintenance for more information about it.

Remember, the point of your CPU testing is not to map out a comprehensive view of its performance in every regard. What you should focus on is making sure performance matches similar hardware. And if this is an upgrade, that it exceeds the expectations you have from older systems.

A more useful look at CPU capacity from a PostgreSQL perspective will be provided in Chapter 8, Database Benchmarking. Running a SELECT-only test using pgbench and more clients than your system has cores, when all the data fits in RAM, is a very good way to see if your CPUs are delivering expected performance. That's my preferred way to see how fast a new CPU I'm evaluating will work for real CPU and memory limited workloads. Just note that the pgbench in PostgreSQL versions before 9.0 can easily act as the bottleneck on results; more on this topic in that later chapter.

Sources of slow memory and processors

If your memory doesn't look to be as fast as it should be, or your CPU results look suspicious, there are a few common things to look for to figure out why.

Most memory is now designed to work in a dual-channel configuration, with pairs of memory put only into specific slots. If that's not done correctly, you'll halve memory speed by running in a single-channel setup. memtest86+ can note when this is happening, and your BIOS may realize it if you look for the information.

Poor quality RAM can introduce a surprisingly large drop in system performance. And just because you have fast memory in your server, that doesn't mean the motherboard is taking advantage of it. The defaults on some systems are quite conservative. Nowadays, your system should be looking up Serial Presence Detect (SPD) information provided by your RAM, to determine how fast it should run. But that doesn't always default to optimal performance, and manual tweaking of the memory timing can be required.

Recent high-performance PC RAM aimed at desktop systems uses a newer standard for the same purpose, the Extreme Memory Profile (XMP) protocol, to communicate the speed it's capable of running at to the system BIOS when you boot. But if your BIOS doesn't default to checking and using XMP, which some don't, your RAM will run at notably non-"extreme" speeds. You should be able to find out how fast your RAM is expected to run as a series of timing values. The Intel i7 860 system mentioned before using DDR3-1600 has timing values of 8-8-8-24 for example. The motherboard did not run the RAM at those speeds until I'd adjusted several settings in it. And just because your hardware vendor should be taking care of all this for you doesn't mean it's safe to ignore this whole issue. It's easy for anyone to miss a step and ship a system with degraded performance.

Another problem that you can run into is using memory that doesn't work well with the clock speed of the processor you're using. Processors are often "locked" to certain multiplier possibilities that are relative to the speed the main system bus runs at. The motherboard ends up doing a complicated negotiation game between the processor and the RAM to find a common denominator speed to run everything at. For example, one of my older systems supported either DDR2-667 or DDR-800 RAM, running at a memory clock of 333 or 400 MHz. The system processor ran at 2.4 GHz, and only supported limited multiplier combinations. It turned out that if I used DDR2-667, the common frequency the motherboard settled on was running the memory bus at 300 MHz, with the CPU using a 8X multiplier. So the RAM was essentially 10% under-clocked relative to its capabilities. Upgrading to DDR2-800 instead used a 400MHz clock and a 6X CPU multiplier. That's a 33% jump in memory speed just from using a better grade of RAM, to better match the CPU clock possibilities, and overall system performance improved proportionately.

In addition to getting the memory and multiplier details right, processor power management is an increasing source of issues when benchmarking hardware. Many operating systems now default to having modest or aggressive processor power management active by default. This is a surprisingly common issue on Linux for example. The normal warning sign is that the processor is only shown as running at 1 GHz in /proc/cpuinfo, with correspondingly slow results on some benchmarks. Normally you will need to adjust the Linux CPU governor setup to optimize for performance, rather than lower power use, at least for the duration of the benchmark. Exact details for how to adjust the governor vary by Linux distribution. You may want to return to optimizing for lower energy use at the expense of some performance afterwards, once you've confirmed performance can be good when needed.


                    
Processor
Cores
Frequency (GHz)
RAM DDR
1 Core
2 Cores
4 Cores
All Cores
T7200           2           2.00           2/667           2965           3084
3084
Q6600           4           2.40           2/800           4383           4537           4390           4390
Opteron 8431 (8 X 6)           48           2.40           2/800           4038           7996           13520           27214
Xeon E5506           4           2.13           3/800           7826           9016           9297           9297
i7 860           8           2.80           3/1600           9664           13096           14293           13231