понеділок, 9 травня 2016 р.

Size cost of C++ exception handling on embedded platform

Introduction

There is a common stereotype, that C++ exception handling code will significantly increase binary image size ("up to 100kb"). In this article, I want to check these claims and find out, how much image size are C++ exception handling code using alone?


To obtain verifiable results, I will use following test method:
  1. Build toolchain (binutils, gcc, newlib), because third-party toolchains may interfere with results;
  2. Create and compile small program, that use exceptions;
  3. Execute it on simulator to verify that program actually works.
I choose ARMv4t aka ARM7TDMI as embedded platform target, and QEMU as software simulator. Ideas for bare metal simulation I took here.

Test program

Code repo is here. There are two versions of program for comparison: one that use exceptions and one that doesn't.

Setting up exception handling

It's rather simple to setup exception handling on ARM. Exception ABI has two defined tables: index table in .ARM.exidx and unwind table .ARM.extab. Linker script should simply combine multiple pieces of .ARM.exidx* together and define two symbols: __exidx_start and __exidx_end around the index.

Bare metal environment

Test program runs in bare metal environment begining from reset interrupt vector. Therefore, it must setup the stack (linker script defines sstack symbol as initial stack value) and zero out .bss area (area between szero and ezero symbols).

Test program uses newlib as standard library and defines "standard" do-nothing syscall stubs.The only meaningful syscall here is _sbrk used to obtain more heap memory. Heap space begins from sheap symbol.

Also, ARM architecture has an option to use either wider (4 bytes) and more complex ARM mode instructions or smaller (2 bytes) and simpler Thumb mode instructions. I'll investigate both options in this article.

Simulator

The easiest way to verify program operation (that I have found) is to use QEMU for versatilepb target. It has simple enough UART hardware API to print few strings onto terminal.
To start simulation, use this command:
qemu-system-arm -M versatilepb -m 128M -nographic -kernel image file
To end simulation, press CTRL+a, c to open qemu monitor and type quit to terminate simulation.

Build toolchain

Straightforward and naive

Ok, lets build toolchain. This script downloads tarballs and builds toolchain into $(pwd)/toolchain directory. It's simple and straightforward. Only thing that should be noted is that gcc requires gmp, mpfr and mpc libraries either preinstalled or unpacked into gcc code tree. In latter case, gcc build machinery will take care of libraries compilation, so I've chosen this way.

And benchmark results are:
ARM modeThumb mode
No exceptions429321
With exceptions11140884232
Difference11097983911
WOW. That's really crazy results, enabling C++ exceptions may increase image size by 110kb!

Optimizations?

Lets rethink this again: by enabling exceptions, image size increases. Many bytes of code and data added. So, where are that code is from? It's from gcc's libsupc++ support library. But, when did it build and, more importantly, what's the CFLAGS was used for it?

Well, there are obscure environment variables CFLAGS_FOR_TARGET / CXXFLAGS_FOR_TARGET that is used to provide optimizations and other cflag-related stuff during libsupc++ compilation as part of gcc build process.

Obviously, toolchain needs to be rebuilt with reasonable optimization flags. Build results are following:
ARM modeThumb mode
No exceptions209189
With exceptions6455645484
Difference6434745295
Results are twice as better now. Also notice, that even code without exceptions benefited from CFLAGS_FOR_TARGET.

However, 64 kb is still a big chunk of code, is it possible to reduce it even more? Lets review exceptions.syms file, it contains all symbols of the resulting binary. Here, it's easy to notice, that *printf stuff is pulled into binary. Function dependency analysis shows the root of it: std::terminate handler is set to a "verbose terminate handler".

Correct C++ way to change terminate handler is to use std::set_terminate. Yet, this would not work for bare metal targets, because initial terminate handler would still be pulled in. The only correct method to replace terminate handler is to overwrite function pointer during build time. Specific variable to look for is called __cxxabiv1::__terminate_handler
ARM modeThumb mode
No exceptions209189
With exceptions6455645484
With exceptions (simple terminate)1715212856
Difference1694312667
Much, much better!

Nevertheless, lets review symbol file again. C++ exception handling code has also pulled in heap management functions, symbols like malloc and free are present in binary. But to measure impact of C++ exceptions on binary size alone, it is needed to cut away heap allocation for C++ exceptions (do not try this at home!).

Exception allocation and deallocation is controlled by two functions __cxa_allocate_exception and __cxa_free_exception. Furthermore, __cxa_allocate_exception has some weird semantics, it must allocate more than requested via function parameter and must return pointer to the middle of an allocated block.

Also, operator delete has to be redefined because of function dependencies. There is no heap anyway, so it isn't a problem. Simulation shows, that program is still operational.

So, the results are:
ARM modeThumb mode
No exceptions209189
With exceptions6455645484
With exceptions (simple terminate)1715212856
With exceptions (simple terminate and no heap)120328816
Difference118238627
Perfect! That's 10x reduction from first test.

Bugs, cheats and hacks

In this section, I would try unorthodox means to improve results. They may or may not work in your environment. I will call the results of previous section as "baseline".

First of all, for some reason, function __cxxabiv1::__is_gxx_exception_class(char*) is duplicated three times in image file. This (inline) function simply checks value of 8 consecutive bytes. So, I made a patch that replaces the function (and similar one) with memcmp:
ARM modeThumb mode
No exceptions209189
With exceptions (baseline)120328816
With exceptions (is_gxx_exception patch)105887828
Difference103797639

Secondly, there is absolutely no need for support for VFP and iWMMXt registers in unwind code on softfloat target. Cutting away dead code is really beneficial:
ARM modeThumb mode
No exceptions209189
With exceptions (baseline)120328816
With exceptions (is_gxx_exception patch)105887828
With exceptions (is_gxx_exception patch, no VFP/iWMMX)94086864
Difference91996675

Lastly, I want to remove support for dynamic exception specification and std::unexpected, because (a) it's deprecated, (b) it worsen runtime performance and (c) it occupies memory. Yet it is part of published standard and therefore its' removal may break things.

Also, gcc has compiler argument -fno-enforce-eh-specs to use in such scenario.
ARM modeThumb mode
No exceptions209189
With exceptions (baseline)120328816
With exceptions (is_gxx_exception patch)105887828
With exceptions (is_gxx_exception patch, no VFP/iWMMX)94086864
With exceptions (is_gxx_exception, no VFP/iWMMX, no unexpected)93086760
Difference90996571

Conclusions

  1. It is confirmed, that use of C++ exceptions may add extra 100kb on embedded targets. Also, it means that toolchain's build is bad;
  2. It is very easy to optimize and reduce extra footprint down to 17kb/13kb (heap management included) by setting up correct CFLAGS_FOR_TARGET / CXXFLAGS_FOR_TARGET during build and terminate handler afterwards;
  3. There are much room for improvement for gcc, at least 3kb/2kb of image size could be removed for some targets.