Device Specific Optimisation

This article discusses what we can do to make the most use of the available computing power from our ARM platform. This entails using the special features provided by the platform.

Unfortunately, by going that way, it also means that we are tying the operating system to the platform - it can no longer be easily moved to another platform (or all the optimisation will break because the other platform may not have all the features offered by the previous one).

Thus, device-specific optimisation is always a trade-off between portability and performance.

Generic optimisation

This is called "generic" optimisation because the same method can be applied to every SoC that you have. It is still "device-specific" because every SoC is different - you can't use NEON SIMD instructions on platforms that don't have it.

Kernel optimisation

This is the first step on optimisation. Support for many of the special devices of the SoC is provided at the kernel level, and to make use of them you need to ensure that you enable and compile the required kernel drivers (either as modules or built-in).

Some SoC provides their own forked Linux kernel which contains switches for additional optimisations of their platforms; make sure that you check and read the documentation (if any) to enable them. Often these optimisations provide a performance boost which can't be obtained in any other way.

Application build-time optimisation

Compilers can generate code in many different ways. For many cases, a compiler would offer optimisation options which will only work if a certain hardware feature is available.

For example, a compiler may be instructed to generate instruction sets that are only available in ARMv7-A platform. Doing it that way, the compiler can make use of shorter and faster instructions available only in ARMv7-A but not in older ARMv6 or ARMv5 architecture. Another example is is the use of hardware floating-point processor (FPU). If the FPU hardware is available, the compiler can be instructed to generate FPU instructions instead of emulating floating point operatin on the main CPU (which is usually an order of magnitude slower).

Another feature that can speed up operations is the usage of Thumb-2 instructions. Thumb2 instruction sets generate smaller code size (it is a mix between 16-bit and 32-bit sized-instructions) while maintaining the same run-time speed; the smaller code size will make the application as a whole run faster because more instructions will reside the cache as well as less waiting time to fetch instructions - at least in theory.

It pays to know exactly the hardware features provided in your platform - the version of ARM architecture that it supports, whether it has an FPU (and exactly which version of FPU that it supports), whether it supports additional SIMD instruction sets (Neon, WMMX2, etc) and use this as the default settings for all applications builds starting at the very base: the glibc and the toolchain itself.

FatdogArm is built with the following optimisation:

These are all the hardware features that comes with A10 SoC.

In the past, these vary wildly between different platforms, but in recent days more and more of these hardware features are converging; thus the chance of an application built for a platform will work on another that provides similar features getting better all the time; in fact many modern SoCs support the optimisation that FatdogArm uses above.

Note: the shipped compiler in FatdogArm doesn't make use of Thumb-2 instructions by default because Thumb-2 instructions does not always work especially when compiling kernel (especially A10 kernel). All is not lost, however, as one can easily tell the compiler to generate Thumb-2 instructions on per-build basis, by passing CFLAGS=-mthumb as your build parameter. Seamonkey 2.20 package, for example, is built this way.

Note2: Thumb-2 instructions only exists in ARMv7. If you are building for ARMv6 or lesser, using "-mthumb" will make gcc generates Thumb-1 (the original Thumb instruction sets) instead, which is *considerably weaker* than the standard instructions, thus needing constant switching between standard instructions and Thumb instructins, slowing down your application. Only use it if you aim to optimise for size, not speed.

There are many other tips for optimisation of this kind, just use your favorite search engine to find it. Examples:

Platform-specific optimisation

Unlike an ordinary CPU, a SoC (System-on-a-Chip) is called so because in addition to the CPU, there are devices which is put together in the same chip that hosts the CPU; these are typically supporting devices (e.g. memory controller, bus controller) for a proper system operation which, in larger computer systems, are delivered in separate chip package, thus enabling an SoC to function as a "system" (=a complete computer) in a chip with minimal or zero additional components.

Modern SoC contains more than just the minimal devices required for system operation; in a competitive world of SoC manufacturers they try to cram as much functionalities as possible into the SoC, e.g. cryptopgrahic engine, 3D graphic accelerators, specialised video processing units (VPU) to accelerate decoding of popular video formats; etc. This is especially useful because architecture is a relatively weak architecture; these extra devices help these SoCs to perform feat which you can't do using the CPU alone.

In our interest to maximise the performance out of the system we have, we need to utilise these additional on-board devices. Unfortunately, our way is full of obstructions:

  1. Many of these devices are under-documented (or not documented at all);

  2. Which is compounded with the fact that not many manufacturers provide device drivers to use these devices;

  3. Some that do, provide only proprietary binary drivers. No source code means:
    • no bug-fixing is possible (other than at the mercy of the manufacturer)
    • no possible future improvements
    • no usage outside what has been prescribed by the manufacturer (e.g. using GPU to accelerate numerical computation? No way)
    • and lastly --> binary drivers tend to get obsolete very quickly because open-source development moves so fast; and when they do, we're back to "no driver" situation;

  4. Some only provide binary drivers for the more popular platform (Android); and the Linux guys are left with the unenviable task of reverse-engineering to get these drivers to work with Linux proper;

  5. Even if the binary libraries are available; it may require patching of applications for them to be able to use it (or worse - forks or private copies of the application altogether). Mainstream / upstream applications that can use these special accelerations are still rare (no doubt because of point #1 above).

  6. And lastly - every SoC (even from the same manufacturers) - have different devices (or same devices but different programming APIs); making optimisation highly customised to one SoC only.

All in all, the situation is not good; and even in the best case, using these optimisations means we are then tied to a single SoC. But it is not possible to completely ignore these devices: the ARM CPU is simply not powerful enough to deliver capabilities that we expect it to (decoding and playing 1080p video, playing 3D games, etc).

Thus, using these optimisations is trade-off. You need to decide which one is more important: portability, or performance. This is true for many cases but it is especially severe in the ARM world. And if you choose performance, you must make sure it is worth it --- check that the target application (media player etc) can really use the accelerated libraries.

In the next section I will highlight some of the more common platform optimisations and for the example I will use the SoC features available in the A10 SoC. I will not go into the depth because they are just examples and highly-specific (the knowledge will not transfer to another platform). If you want to know more of the details, for A10 SoC, visit http://linux-sunxi.org/Binary_drivers.


Platform-specific video driver

This is the first thing to do. While the generic framebuffer video driver will work fine, it is never geared towards performance. Platform-specific video driver almost always provides better performance than its generic counterparts: other than the obvious part that it can utilise the platform's 2D and 3D functions (if any), it also provides additional X extensions that generic driver doesn't have: things like Xvideo extension, Xshm, hardware cursors, and perhaps others.

A10's platform-specific video driver is called xf86-video-sunxifb. This driver, along with the 2D acceleration (see below), enables A10 to play SD video upscaled to full-screen 720p with only 40% of CPU utilisation.

As a comparison, the generic driver will eat 100% CPU time trying to do the same feat and still fails (the video plays like page flipping).

The video driver is used by and impacts all graphics application (video-related or not) - providing a smoother scrolling experience, responsive window dragging, etc - so this is the biggest bang for the buck. If you only have time to do one optimisation; this is the one to go.

FatdogArm ships with sunxifb video driver enabled as the default driver.

2D accelerator

A 2D accelerator provides functions to speed up common 2D graphics operations (line drawings, the blitting operations, etc). A10 has a simple 2D accelerator called G2D which accelerates blitting operations. To use it, one must do two things:

Once enabled the acceleration will usually be used by the platform-specific video driver; thus it willalso be immediately available for all graphical applications without further modification. This makes it another worthwhile effort to get going.

FatdogArm's kernel is built with G2D driver built-in and the sunxifb driver is included and is used by default.

3D Graphics accelerator

A 3D accelerator provides functions to speed up 3D drawing operations. In modern systems the 3D accelerator is usually provided by a Graphics Processor Unit (GPU).

A10 comes with Mali 400 GPU. To utilise this GPU, certain things must be done:

This 3D functionality is exposed through standard OpenGL ES and EGL APIs; in theory any application that uses these libraries will immediately benefit from the acceleration without further changes.

However it is important to remember that OpenGL ES ("Embedded Systems") is not the same as the full OpenGL stack normally used on the desktop; and programs configured to exploit OpenGL will not benefit from OpenGL ES accelerations at all.

I have built all the components for 3D acceleration including libUMP-aware sunxifb driver; however the test failed to run due to glibc issue. It could be fixed by 'upgrading' glibc but doing so will require (many hours of) recompilation of many of the base packages - glibc, gcc and others; something that I'm not prepared to do at the moment, considering minimal return of the effort.

Video Processing Unit (VPU)

An VPU is a module which helps to accelerate video processing functions, usually to provide hardware decoding of certain video formats.

A10 has a VPU called CedarX, which, amongst others, enables hardware decoding of H.264, MPEG-4 and MPEG-2 video. Using CedarX, it is possible to play HD videos up to 2160p resolution.

Unfortunately the support for this is patchy; there is a Linux binary library (libve) but it is buggy, forcing people to use the Android version of the library (libhybris) instead.

In addition, this is not a standard library and applications (=mainly media players) need to be patched to work with this. So far the only support I've seen is from a forked version of VLC.

Ref: http://linux-sunxi.org/Category:CedarX

I will attempt to build Cedar-enabled VLC later.

Cryptographic accelerator

A cryptographic accelerator provides hardware assist for generation of various common message digests and encryption/decryption. This is usually used to make decryption faster (especially on DRM-encoded video streams).

A10 features TrustZone, a cryptographic accelerator for encryption and decryption of AES, DES, 3DES and computing SHA-1 and MD5 hashes.

Unfortunately there doesn't seem to be any library or code which can be used to accelerate common operations in libraries such as openssl or nss at the moment.


Utilising available peripherals

These are not strictly optimisation, but since they touch the same subject of making the most use of what is offered by the platform / SoC, they are listed here as well.

Network devices

This is usually easy as many platforms release the Ethernet and wireless network drivers with their kernel source release; so it is a matter of choosing the right driver and build them.

In Mele the wireless network is 8192cu (a special version of rtl8192cu connected through USB with built-in firmware).

Mele IR remote control

The Mele A1000 comes with a remote control. As it turns out the remote control is controlled by the sun4-ir kernel module. Once the module is loaded, the remote will appear as a standard Linux keyboard evdev device; and will generate scancodes and keycodes just like any other keyboard.

From here, it is just a matter of using udev keymap rules to transform the keycodes into something which is more usable.

Tablet touchscreen input

Please refer to TouchScreenInput article.
The key problem here is to obtain the touchscreen kernel driver - it is not always available.

Tablet hardware buttons

Some tablets come with hardware buttons (volume up/down, power-on, etc). Sometimes these buttons are accessible - again it depends on the availability of the kernel driver. One of my tablets uses sun4i-keyboard and it produces standard XF86RaiseVolume, XF86LowerVolume and XF86PowerOff keysyms when pressed (the same keys are available from console using showkeys too).

Tablet LCD brightness

I have not found any driver for controlling the LCD backlight on my tablet.

The strength of the backlight on many LCD panels are usually controlled by PWM, either directly by the SoC (using SoC's PWM output) or indirectly (SoC connects its GPIO pins to a PWM controller which then controls the LCD light source).

Exactly how this is to be done depends on the board/hardware design; in theory it is easy to write a driver which interface to the kernels' video backlight infrastructure; but doing so requires intimate knowledge of the hardware.

Checking battery status

Unlike in the x86 world, there is no APM or ACPI and thus the battery status isn't available through ACPI interfaces. Different platforms have different strategy on how to convey battery status information to the applications (mainly determined by which power-management chip they use).

In A10 the main power-management chip is usually an AXP209 and there is a driver for it, which conveys battery power status change through udev events. This can be obtained either by using libudev or by creating udev rules.