Thursday, October 06, 2005

Microsoft Visual Studios 2005 Team System, Episode III (Microsoft Visual C++ 2005 new Features Part B)

Now it's time for Visual C++ 2005 new features part B, so what else?!

3- Additions for optimizing the generated applications:


Another huge part is the optimization inside Visual C++ 2005, you'll know all now about PGO (profile guided optimization) in this part.

We all know about aggressive optimization offered before in Visual C++ 2003 and 2002, I don't know whether they have any modifications inside it or not, but I'd like to talk about PGO now because it's really a very nice features.


My source in this article is MSDN in addition to a private seminar held by one of the VC++ team members in my old company Sakhr, you can find the whole article source at:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_vstechart/html/profileguidedoptimization.asp


Now let's jump into details, to know about how PGO works, you need a quick idea about how normal optimization in compilers are working, to know this, look at this sample:





int setArray(int a, int * array)
{
for (int x = 0; x array[x] = 0;

return x;

}


From this file the compiler knows nothing about the potential values for "a" (besides they must be of type int), nor does it know anything about the typical alignment of array.
This compiler/linker model is not particularly bad, but it misses two major opportunities for optimization: First, it doesn't exploit information that it could gain from analyzing all source files together; secondly it does not make any optimizations based upon expected/profiled behavior of the application. That's what WPO (Whole Program Optimization) and PGO (Profile Guided Optimization) are doing.

Before knowing PGO, let's know at first how traditional WPO in Visual C++ is working.

With Visual C++ 7.0 and beyond, including all recent versions of the Itanium® compiler, Visual C++ has supported a mechanism known as link-time code generation (LTCG).

LTCG is a technology that allows the compiler to effectively compile all the source files as a single translation unit. This is done in a two-step process:

1- The compiler compiles the source file and emits the result as an intermediate language (IL) into the generated .obj file rather than a standard object file. It's worth noting that this IL is not the same thing as MSIL (which is used by the Microsoft® .NET Framework).

2- When the linker is invoked with /LTCG switch, the linker actually invokes the backend to compile all the code compiled with WPO. All of the IL from the WPO .obj files are aggregated and a call graph of the complete program can be generated. From this the compiler backend and linker compiles the whole-program and links it into an executable image.

Now with WPO, the compiler knew all about the structure of the whole program, and so will be more effective certain types of optimizations.
For example, when doing traditional compilation/linking, the compiler could not inline a function from source file foo.cpp to source file bar.cpp. When compiling bar.cpp, the compiler does not have any info about foo.cpp. With WPO the compiler now has both bar.cpp and foo.cpp (in IL form) available to it, and can make optimizations that ordinarily would not be possible (like cross translation unit inlining).

To compile your program using LTCG, first you compile your program using whole program optimization switch: /GL
After that link it using /LTCG switch.

Now your generated executable is always faster, although it needs more memory and time to compile using this switch, but the final output deserves these.

But what if you want more optimization?!
It's now PGO turn.

Now you gained a bit performance using WPO through /LTCG switch, but you can gain additional performance which will be very significant using PGO in addition to LTCG.
The idea of how PGO is working is simple, you just run your application several times, and every time a profile for your application is created which is containing all the info about how many times each function is executed, and about the most used parts of your code, and then you use it after that to re-build your application based on this profiles.
The whole process is done in 3 main stages:

1- Compile into instrumented code (first compilation phase).

2- Train instrumented code (profiles generation phase).

3- Re-compile into optimized code (based on generated profiles).

And here you're a graphic representation for how the whole process looks like:


Now to pass through these 3 stages:

1- Compiling instrumented code:

The first phase is to instrument the code, to do this, you first compile the source files with WPO (/GL). After this take all of the source files from the application and link them with the /LTCG:PGINSTRUMENT switch (this can abbreviated as /LTCG:PGI). Note that not all files need to be compiled as /GL for PGO to work on the application as a whole. PGO will instrument those files compiled with /GL and won't instrument those that aren't.

The instrument of you code means putting several probes inside your code, there're two main types of probes, those for collecting flow information and those for collecting value information.
The result of linking /LTCG:PGI will be an executable or DLL and a PGO database file (.PGD). By default the PGD file takes the name of the generated executable, but the user can specify the name of the PGD file when linking with the /PGD:filename linker option.


2- Training of instrumented code:

Now to train your code and generate well profiles that really reflect the usage of your application, you run your application several times with scenarios that reflect their usage in real life, after every scenario a PGO count file (.PGC) is generated and takes the name of the .PGD file with numbers reflecting the scenario number.


3- Re-Compile the Optimized code:

The last step is to re-compile your application using the generated profiles through the different scenarios you made, this time when you link you application you use the switch: /LTCG:PGOPTIMIZE or /LTCG:PGO which will use the generated profiles to create the optimized application.

Now we knew in brief how to use PGO, but the most important thing is to know how PGO could help optimizing applications?
And this is the next section.

There are several things PGO can help in the applications optimization, which are:

1- Inlining:

As described earlier, WPO gives the application the ability to find more inlining opportunities. With PGO this is supplemented with additional information to help make this determination. For example, examine the call graph in Figures 2, 3, and 4 below.
In Figure 2. we see that a, foo, and bat all call bar, which in turn calls baz.


Figure 2. The original call graph of a program



Figure 3. The measured call frequencies, obtained with PGO




Figure 4. The optimized call-graph based on the profile obtained in Figure 3


2- Partial Inlining:

Next is an optimization that is at least partially familiar to most programmers. In many hot functions, there exist paths of code within the function that are not so hot; some are downright cold. In Figure 5 below, we will inline the purple sections of code, but not the blue.



Figure 5. A control flow graph, where the purple nodes get inlined, while the blue node does not



3- Cold Code Separation:

Code blocks that are not called during profiling, cold code, are moved to the end of the set of sections. Thus pages in the working set usually consist of instructions that will be executed, according to the profile information.



Figure 6. Control flow graph showing how the optimized layout moves basic blocks together that are used more often, and cold basic blocks further away.


4- Size/Speed Optimization:

Functions that are called more often can be optimized for speed while those that are called less frequently get optimized for size. This tends to be the right tradeoff.

5- Block Layout:

In this optimization, we form the hottest paths through a function, and lay them out such that hot paths are spatially located closer together. This can increase the utilization of the instruction cache and decrease the working set size and number of pages used.

6- Virtual Call Speculation:

Virtual calls can be expensive due to the jumping through the vtable to invoke method. With PGO, the compiler can speculate at the call site of a virtual call and inline the method of the speculated object into the virtual call site; the data to make this decision is gathered with the instrumented application. In the optimized code, the guard around the inlined function is a check to ensure that the type of the speculated object matches the derived object.

The following pseudocode shows a base class, two derived classes, and a function invoking a virtual function:


class Base

{
...
virtual void call();
}
class Foo:Base
{
...
void call();
}
class Bar:Base
{
...void call();
}
// This is the Func function before PGO has optimized it.
void Func(Base *A)
{
...
while(true)
{
... A->call();
...
}
}

The code below shows the result of optimizing the above code, given that the dynamic type of "A" is almost always Foo.

// This is the Func function call after PGO has optimized it.
void Func(Base *A){
...
while(true) {
...
if(type(A) == Foo:Base) {
// inline of A->call();
}
else
A->call();
...
}
}

4- Changes in compiler switches:

You can find them all here:
Last but not least, I forgot to mention that they had added some pre-processing directive to enable on fly distributed-processing for your code when you have more that one processor running on the machine which is running your application :)
Next time I'll be talking about additions in C# Language, so stay tight


Catch you later.....

4 comments:

Mohamed Moshrif said...

in the first code block, the loop is looking like this:

for(int x = 0;x< a ; x++)

but the blog spot is not allowing me to enter < nor to write it using html encoding!!!!!

Anonymous said...

Very exciting
I know about C++ concepts but I've never done a project with it
now I want to learn VC++
How can I do with it? How and where can I start?
I feel so regretful for not concerning it and spending time only on oracle but I realy want to learn it
Please help

Mohamed Moshrif said...

Try:

http://www.cprogramming.com/

Anonymous said...

Thank you