Optimizing Your C/C++ Applications

C, C++, Visual C++, C++.Net Topics
Post Reply
Jane
Corporal
Corporal
Posts: 10
Joined: Sun Jul 19, 2009 9:19 pm

Optimizing Your C/C++ Applications

Post by Jane » Thu Sep 03, 2009 4:49 am

  1. Declare as static functions that are not used outside the file where they are defined

    Declaring a function as static forces an internal linkage within that file, which can improve the performance of the code. Functions that are not declared as static default to external linkage, which may inhibit certain optimizations, such as aggressive inlining, with some C/C++ compilers.
  2. Use array notation instead of pointer notation when working with arrays

    You can use either array operators ([]) or pointers to access the elements of an array. However, it's hard for the compiler to optimize pointers because the pointers are, by default, less constrained—the compiler has to assume that the pointer can read/write to any location in memory. Because array notation is less ambiguous, the compiler may be able to optimize it more effectively.

    Because it's hard to know which way is more efficient, you may want to try coding your array reference using both pointers and array notation, and use a code profiler to see which way is fastest for your particular application.
  3. Completely unroll loops that have a small fixed loop count and a small loop body

    As you know, many compilers do not aggressively unroll loops. Manually unrolling loops can benefit performance, because you're avoiding the loop overhead; for a small loop, that overhead can be significant.

    For example, avoid a small loop like this:

    Code: Select all

    // 3D-transform: Multiply vector V by 4x4 transform matrix M.
    for (i = 0; i < 4; i++) {
       r[i] = 0;
       for (j = 0; j < 4; j++) {
          r[i] += m[j][i] * v[j];
       }
    }
    Instead, replace it with its completely unrolled equivalent, as shown here:

    Code: Select all

    r[0] = m[0][0] * v[0] + m[1][0] * v[1] + m[2][0] * v[2] + m[3][0] * v[3];
    r[1] = m[0][1] * v[0] + m[1][1] * v[1] + m[2][1] * v[2] + m[3][1] * v[3];
    r[2] = m[0][2] * v[0] + m[1][2] * v[1] + m[2][2] * v[2] + m[3][2] * v[3];
    r[3] = m[0][3] * v[0] + m[1][3] * v[1] + m[2][3] * v[2] + m[3][3] * v[3];
  4. Consider expression order when coding a compound branch

    Complex branches—those which consist of many Boolean expressions with logical AND (&&) and logical OR (||) operators—provide great opportunities for optimization, if you know which AND and OR conditions are more likely to occur.

    That's because the C/C++ compiler will optimize the code to terminate execution of that complex branch as soon as it's sure if the operand evaluates to true or false, depending on the logic. For example, the branch conditions are satisfied as soon as it detects a true operand in an OR expression; similarly, the application knows that the branch will fail as soon as it sees a false operand in an AND expression.

    You might experiment with changing the order of the operands if you know that a true/false outcome is more probable for a specific operand. For example, in an AND expression, place the operand most likely to be false up front. Of course, you'll need to make sure that changing these order of the operands won't cause any unwanted side effects.
  5. Leverage the early termination abilities of complex If statements

    As seen above, as soon as a true operand is found in an OR expression, or a false one is found in an AND expression, execution of that particular expression is terminated because outcome is known. You can leverage that, if you have a good idea as to whether an entire expression is likely to be true or false, to try to short-cut execution of the condition.

    For example, say you're testing to see if a and b are going to be within a specified numeric range. If you know that those numbers will generally be within that range, you can use the following because the most likely outcome will evaluate to false and thereby terminate the AND condition:

    Code: Select all

      if (a <= max && a >= min && b <= max && b >= min) 
    If you know that the data will generally not be in a specified range, you can logically reverse the If statement to optimize execution in the same way, with a true terminating the OR condition:

    Code: Select all

      if (a > max || a < min || b > max || b < min) 
  6. Use if-else statements in place of switch statements that have noncontiguous case expressions

    If the case expressions are contiguous or nearly contiguous integer values, most compilers translate the switch statement as a jump table instead of a comparison chain. Jump tables generally improve performance because they reduce the number of branches to a single procedure call, and shrink the size of the control-flow code no matter how many cases there are. The amount of control-flow code that the processor must execute is also the same for all values of the switch expression.

    However, if the case expressions are noncontiguous values, most compilers translate the switch statement as a comparison chain. Comparison chains are undesirable because they use dense sequences of conditional branches, which interfere with the processor's ability to successfully perform branch prediction. Also, the amount of control-flow code increases with the number of cases, and the amount of control-flow code that the processor must execute varies with the value of the switch expression.

    For example, if the case expression are contiguous integers, a switch statement can provide good performance:

    Code: Select all

    switch (grade)
    {
       case 'A':
          ...
          break;
       case 'B':
          ...
          break;
       case 'C':
          ...
          break;
       case 'D':
          ...
          break;
       case 'F':
          ...
          break;
    But if the case expression aren't contiguous, the compiler may likely translate the code into a comparison chain instead of a jump table, and that can be slow:

    Code: Select all

    switch (a)
    {
       case 8:
          // Sequence for a==8
          break;
       case 16:
          // Sequence for a==16
          break;
       ...
       default:
          // Default sequence
          break;
    }
    In cases like this, replace the switch with a series of if-else statements:

    Code: Select all

    if (a==8) {
       // Sequence for a==8
    }
    else if (a==16) {
       // Sequence for a==16
    }
    ...
    else {
       // Default sequence
    }
  7. Sort and pad C and C++ structures to achieve natural alignment

    If your compiler allows it, pad data structure to make their sizes a multiple of a word, doubleword or quadword.

    The best approach is to sort the structure members according to their type sizes, declaring members with larger type sizes first. Then, pad the structure so the size of the structure is a multiple of the largest member's type size.

    For example, you should avoid structure declarations in which the members are not declared in order of their type sizes and the size of the structure is not a multiple of the size of the largest member's type. This code is suboptimal:

    Code: Select all

    struct {
       char a[5]; \\ Smallest type size (1 byte * 5)
       long k; \\ 4 bytes in this example
       double x; \\ Largest type size (8 bytes)
    } baz;
    A better approach is:

    Code: Select all

    struct {
       double x; \\ Largest type size (8 bytes)
       long k; \\ 4 bytes in this example
       char a[5]; \\ Smallest type size (1 byte * 5)
       char pad[7]; \\ Make structure size a multiple of 8.
    } baz;
    Your he compiler will use padding to "naturally align" elements by default. That means that that pointer-size elements will grow when compiled for 64-bit targets, which can affect the structure size and padding, and cause unnecessary "data bloat" unless care is taken to avoid this.
  8. Declare single-precision float constants using f

    Most C and C++ compilers treat floating-point constants and arguments as double precision unless you specify otherwise. While that obviously gives you more accuracy, single precision occupies half the memory space and can often provide the precision necessary for a given computational problem.

    So, if you're trying to use a constant value for the conversation factor between millimeters and inches, you might define it as 25.4f instead of as 25.4, to provide a single-precision constant value that may be adequate for your task.
  9. Restructure floating-point math to reduce the number of operations, if possible

    Floating point operations have long latencies, even when you're using x87 (in 32-bit mode) or SSE/SSE instruction set extensions (in 32-bit or 64-bit mode), so those types of operations should be a target for optimization, particularly if you take into consideration the size of the AMD Opteron or Athlon64 processors' instruction pipeline.

    However, you should be careful because even when algebraic rules would permit you to transform one floating-point operation into another, the algorithms within a microprocessor may not yield exactly the same results down to the least significant bits. You should consult a book on numerical analysis or experiment to make sure that it's okay to tinker with the math.

    Here's an optimization example that involves the concept of data dependencies. This example, recommend by AMD, uses four-way unrolling to exploit the four-stage fully pipelined floating-point adder. Each stage of the floating-point adder is occupied on every clock cycle, ensuring maximum sustained utilization. Mathematically, it's equivalent; computationally, it's faster.

    The original code:

    Code: Select all

    double a[100], sum;
    int i;
    sum = 0.0f;
    for (i = 0; i < 100; i++) {
       sum += a[i];
    }
    This version is faster, because the code implements four separate dependence chains instead of just one, which keeps the pipeline full. The /fp:fast compiler switch in Visual Studio 2005 can help do this sort of optimization automatically.

    Code: Select all

    double a[100], sum1, sum2, sum3, sum4, sum;
    int i;
    sum1 = 0.0;
    sum2 = 0.0;
    sum3 = 0.0;
    sum4 = 0.0;
    for (i = 0; i < 100; i + 4) {
       sum1 += a[i];
       sum2 += a[i+1];
       sum3 += a[i+2];
       sum4 += a[i+3];
    }
    sum = (sum4 + sum3) + (sum1 + sum2);
  10. Manually extract subexpressions from floating-point operations

    Because the C/C++ compiler won't change your math around, you'll have to apply algebraic optimizations yourself. As mentioned earlier, there's a slight chance that changing the way that the math is performed may change the least-significant bits of the results, so always exercise caution when performing this type of optimization. That /fp:fast compiler switch in Visual Studio 2005 can do some of these optimizations automatically, too.

    Here's an example of code that can be optimized by removing a common subexpression. Before:

    Code: Select all

    double a, b, c, d, e, f;
    e = b * c / d;
    f = b / d * a;
    And after:

    Code: Select all

    double a, b, c, d, e, f, t;
    t = b / d;
    e = c * t;
    f = a * t;
    Here's another example. Before:

    Code: Select all

    double a, b, c, e, f;
    e = a / c;
    f = b / c;
    And after:

    Code: Select all

    double a, b, c, e, f, t;
    t = 1 / c;
    e = a * t
    f = b * t;
  11. Use 32-bit integers instead of 8-bit or 16-bit integers

    This is a tip specifically for 32-bit applications running either natively on a 32-bit operating system, or within a 64-bit operating system. You'll see better performance because operations on 8-bit or 16-bit integers are often less efficient. Of course, you'll be using twice or four times the memory, so be aware of that trade-off.

    By the way, 32-bit integers execute at full speed when you're running in 64-bit mode, so unless you need the extra bits for some application-specific reason, you should stick to 32-bit ones.
  12. If you don't need signed data types, note that operations on unsigned types can be faster

    In many case—such as with counters, array indexes, or quotients/remainders after division—it's almost certain that your integer will be non-negative. In such cases, operations on unsigned integers may be more efficient.

    However, if you're using those integers for mixed-mode (that is, integer combined with floating point) type of arithmetic, AMD advises that integer-to-floating-point conversion using integers larger than 16 bits is faster with signed types, because the AMD64 architecture provides native instructions for converting signed integers to floating-point but has no instructions for converting unsigned integers.
  13. Replace integer division with multiplication when there are multiple divisions in an expression

    Integer division is the slowest of all the integer arithmetic operations on most microprocessors, including the AMD Opteron and Athlon64 processors. So, multiply instead of dividing! (Be careful to ensure that you won't have an overflow during the computation of the divisors.)

    Restructure code that uses two integer division operations like

    Code: Select all

    int i, j, k, m;
    m = i / j / k;
    to be faster if you replace one of the integer divisions with the equivalent multiplication:

    Code: Select all

    int i, j, k, m;
    m = i / (j * k);
  14. Use SSE inline instructions for performing single-precision division and square root operations.

    Division and square root operations have a much longer latency than other 32-bit floating-point operations, advises AMD, and in some application programs, these operations occur so often as to seriously impact performance. It's often faster to drop to SSE inline instructions to perform that operation. (If your compiler can optimize to the SSE instruction set extensions, it may perform this task for you.)

    If your 32-bit code has hot spots that use single precision arithmetic only, you can set the precision-control field which is part of the x87 floating point unit to single precision, and override Windows' default of double-precision for those operations. Because the processor's latency is faster when doing single-precision division or square roots than double-precision, it'll save you some type conversions or latencies.
Post Reply

Return to “C/C++ Programming”