Performance Optimizations

This section lists a few issues to consider to achieve optimal performance with On Time RTOS-32 applications.

SMIs

System Management Interrupts (SMIs) are special interrupts used by the BIOS to emulate hardware. SMIs have a higher priority than standard interrupts and can thus increase the overall interrupt latency. It is thus recommended to disable all BIOS services which use SMIs internally. For example, if the target has an AHCI disk controller and the BIOS has an option to operate this controller either in IDE or AHCI mode, AHCI is recommended as the IDE emulation would require SMIs.

USB legacy emulation for USB keyboards, mice, and storage devices is also implemented using SMIs. Here, the initialization of RTUSB-32 would disable the emulation and all associated SMIs. For applications which do not use RTUSB-32, all USB legacy emulation should be disabled in the BIOS.

Keep Alive Packets of the Cross Debugger

When cross debugging over Ethernet, the host debugger must send Keep Alive packets to the Debug Monitor periodically to keep the connection up and running through a firewall. When the Monitor receives a packet, it suspends the application, processes the packet, and then continue the application. This can take up to about 1 millisecond. During this time, the application is stopped and is unable to process interrupts.

Sending Keep Alive packets can be disabled in Rttarget.ini, but it may require adding a firewall rule. Please see the documentation in section Rttarget.ini for details.

Cache Line Aligned Data

Most x86 CPUs have a cache line size of 64 bytes. PCI[e] transfers are most efficient if they address a complete cache line without crossing cache line boundaries. Since many On Time RTOS-32 drivers allocate DMA buffers from the heap, applying RTTarget-32 flag RT_HEAP_MIN_BLOCK_SIZE_64 can improve performance on some systems, though it may require more heap space. For RTIP-32, all CFG_PACKET_SIZEx values should be multiples of 64 (which they are by default).

RTKernel-32's Standard and Debug Versions

RTKernel-32's Standard Version is faster and delivers lower interrupt latencies than the Debug Version. However, it is very important that the Debug Version is used during software development. For details, please see section RTKernel-32's Debug Version.

Thread Priorities

Applications should use as few priorities as possible. When several threads in the state Ready have the same priority, the RTKernel-32 scheduler can decide which one will run, and it will select the most efficient one, minimizing the scheduler's overhead.

Avoid Time Slicing

Time Slicing should not be used in real-time applications. In particular on multiprocessor systems, its use adds a lot of overhead (CPU time in timer interrupt, many additional IPIs, etc). For further details please see section Avoid Time Slicing in the RTKernel-32 Programming Manual.

Avoid RTKSetCPUMask and SetThreadAffinityMask

Functions RTKSetCPUMask and SetThreadAffinityMask reduce the number of CPUs available for threads which will reduce performance in most cases. The documentation of function RTKSetCPUMask explains this in more detail.

The recommended way to distribute threads onto different CPUs is by assigning hardware interrupts to different CPUs through function RTMPBalanceINTCPUs. Many threads are activated through interrupts, and by default, an activated thread will be scheduled to run on the same CPU the interrupt came in. However, if the activated thread is not allowed to run on the local CPU, an IPI must be send from the CPU of the ISR to the CPU to run the thread, delaying the task activation significantly.

Network Throughput

If the Debug Monitor and the application share the same network interface, network throughput can be degraded by 10-20%. For best performance, separate network interfaces for the Monitor and the application are recommended.

By default, RTIP-32's functions send and sendto used with a socket in blocking mode will wait until all sent data has been acknowledged by the receiver, which slows down sending. For best send performance, operate the socket in non-blocking mode or set config parameter CFG_TCP_SEND_WAIT_ACK to 0.

TCP connections have a window which defines how much data may be transferred before the sender must wait for an acknowledgement. If the network bandwidth is high but the turn-around-time is long, a larger window may be required. Function xn_interface_opt, options IO_OUTPUT_WINDOW and IO_INPUT_WINDOW can be used to adjust the default TCP window sizes. The window sizes can also be set on a per socket basis using function setsockopt, options SO_INPUT_WINDOW and SO_OUTPUT_WINDOW. However, large TCP windows need more buffer space (CFG_NUM_PACKETS3) and thus more heap space.

Jumbo Frames can also help to improve the throughput of network connections. For TCP connections, the window size should be at least 8 times the MSS (which is 54 bytes less than CFG_MAX_PACKETSIZE). Checksum Offloading can also be used to reduce the CPU load, in particular when Jumbo Frames are being used. Function xn_interface_opt, options IO_CHKSUM_OFFLOAD can enable Checksum Offloading.

Web, FTP, and SMB Server Throughput

In addition to all points described in the previous paragraph Network Throughput, the RTIP-32 I/O Accelerator should be initialized at program startup to speed up the Web, FTP, and SMB servers. For details, please see function xn_ioacc_init.

Energy Saving

When the CPU has nothing to do, it should be placed in the Halt State, which is entered when the HLT instruction is executed. The RTKernel-32 Idle Tasks do this if preemptions are enabled and RTKConfig.DriverFlags DF_IDLE_HALT is set.

However, the CPU Monitor will replace the Idle Tasks with its own if function CPUMonitorStart must use method CPU_COUNTER_TASK. This is the case by default if the RTKernel-32 Standard version is linked as its default RTK32Config.Flags does not have flag RF_TCPUTIME set, which the CPU Monitor needs to implement method CPU_IDLE_TASK. It is thus recommended to either not start the CPU Monitor in the release build or to set F_TCPUTIME in RTKConfig.Flags.

Example:

RTKConfig.Flags       |= RF_PREEMPTIVE;   // needed only with single CPU kernel 
RTKConfig.DriverFlags |= DF_IDLE_HALT;    // ditto 
RTKernelInit(0); 
if (RTKDebugVersion()) 
   if (CPUMonitorStart(CPU_IDLE_TASK) != CPU_IDLE_TASK) 
      Error("CPU Monitor running in wrong mode!\n"); 

// or: 

RTKConfig.Flags       |= RF_PREEMPTIVE | RF_TCPUTIME; 
RTKConfig.DriverFlags |= DF_IDLE_HALT; 
RTKernelInit(0); 
if (CPUMonitorStart(CPU_IDLE_TASK) != CPU_IDLE_TASK) 
   Error("CPU Monitor running in wrong mode!\n");

Please note that setting flag RF_TCPUTIME does have a negative impact on performance as it requires the kernel to call FTReadTime in each task switch. The degree of performance loss depends on the speed of the used High Resolution Timer driver. Demo program RTBenchP measures the performance of function FTReadTime.

Left Advanced Topics

RTRth-32