Advances in sensor technologies available nowand predicted over the next few years are providing engineers with the ability to implement sophisticated image processing systems featuring increased resolutionsbandwidthsfunctionality and data rates.
Traditionallythese sensor systems have been supported using a variety of conventional processing technologies such as RISC and DSP processorsor expensive and inflexible custom solutions such as ASICs. As the aggregate throughput requirements for image processing systems approaches GOPSthese types of processing technologies become unsuitable since performances can only be met by concatenating processing blocks in a pipeline architecture. This incremental approach to boosting system performance has limitationsand is often an unacceptable solutionparticularly for airborne applications that are constrained by factors such as sizeweightpower and environmental conditions. Assuming that these problems are overcomeor at least manageableapplications demanding real time operation cannot function with system latencies in the order of seconds. Compromises are possible through reductions in frame rate and image resolution; however these sacrifices only contribute to overall system performance degradationand do not address the underlying problem. There isthereforea requirement to utilise new types of processing platforms featuring flexiblelow latencyhigh performance devices such as field programmable gate arrays (FPGAs) in order to keep pace with the increasing rates at which data is sensed at the front end of the system.
The majority of image processing algorithms involve the use of mathematical functions executing repetitively upon sample data. This data can comprise of individual pixelsgroups of pixelsor entire image frames supplied to a processing device as part of a data stream.
Until recentlythe choice of processing technology for such an application has been limited to a range of microprocessorswith the computational capabilities of each new generation of microprocessor increasing steadily in accordance with Moore’s Law. Howevereven using the latest generation of microprocessorthe underlying performance limitations of this technology still remain. Take for examplethe implementation of a typical image processing operation such as convolution. This involves a matrix multiplication resulting in several multiplications per image pixel. Using a modest 5x5 window and a conventional DSP processora latency of several clock cycles per pixel is incurred since the overall performance of the processor is limited by the number of multiplications that can be performed in parallel. On-chip processor memory is usually too small to buffer a full image frameand so external memory read and writes are required to complete a single calculation. This performance bottleneck becomes a problem when larger windows are implemented and higher frame rates and resolutions are used to achieve better imaging accuracies.
Take for example the TigerSHARC DSP processor from Analog Devices. This device features an I/O bandwidth of 1800Mbytes/s (including memory) and a peak floating point performance of 1500MFlops. A Xilinx Virtex-II ProXFPGAby comparisonis capable of an I/O bandwidth of 37500Mbytes/sand a peak floating point performance of 25000MFlops.
Unsurprisinglyprocessing technologies such as FPGAs are an attractive solution for many of the computing challenges associated with high performance image processing systems. The reconfigurable array of logic blocksmemories and multipliers provided within FPGAs by vendors such as Altera and Xilinxoffer a high performance hardware architecture ideal for building processing pipelines operating at hundreds of MHz. Takefor examplethe new Virtex-4 range of FPGAs from Xilinx. Manufactured using the latest 90nm processing technologythey provide users with 500MHz XtremeDSP slices delivering an aggregate DSP performance of 256GigaMACs per second. High accuracy Digital Clock Managersreconfigurable synchronous dual-port static BRAM and FIFOs provide the necessary clock management and memory resource required to implement high performance algorithms. Continuing the trend set by the previous generation of Virtex-IIProFPGAsthe Virtex-4 features 32-bitRISC PowerPC processors delivering an excess of 1300DhrystoneMIPS.
As a resultFPGAs are now being used as DSP engines. Although today’s DSP processors boast high levels of performancethey can’t compete against FPGAs for specialised computing. FPGAs can be configured with a custom hardware designimplementing control logic in the hardwaresaving precious clock cycles per calculation.
Innovation and state of the art silicon processing techniques such as those used for the new Virtex-4 range of devices have dramatically improved the functionality and capability of the FPGA over the past six yearsallowing them to be used in a wide variety of applications typically dominated by microprocessors or expensive and inflexible ASICs.
One of the difficulties for engineers and scientists wanting to use FPGA technology to help improve the performance of their applications has been the availability of flexiblescalable COTS products supporting the latest FPGAs and design tools.
There have been a number of modular standards used over the last few years that have supported new generations of processing technologiesincluding FPGAs; however they have limitations when used in real time processing applications. Firstly there are those based around specific microprocessorsfor example the TIM-40 from Texas Instruments and SHARCPAC from Analog Devices. The main difficulty with this category is that the system engineer has to constrain the capability of supported FPGAs in order to emulate a microprocessor interface – restricting the superior IO bandwidth of the FPGA. Secondlythere are the microprocessor neutral module standards. One of the most popular is the PCI Mezzanine Card (PMC). Unfortunatelythis is still principally designed with microprocessor based systems in mind. It is alsoperhaps more seriouslybased around a non deterministic bus communications system with variable latency. This again implies constraining the FPGA to a less than optimum solution. In additionsignificant parts of the FPGA real-estate must be dedicated to handling the non-determinism. Within a real-time system it is critical that bandwidths and latency can be guaranteed. Using this type of module meanspracticallythat this cannot be done.
In order to address these problemsand present a processing platform that truly exploits the strengths of FPGA technologyNallatech has developed a range of COTS plug and play motherboards and modules supporting the latest Virtex-II and Virtex-IIProFPGAs from Xilinx. The high-performance ‘DIME-II’ architecture is an open standard incorporating system level intelligence features such as temperaturevoltage and current monitoringand guarantees a module to motherboard bandwidth of up to 8GBytes/sec (over 15 times the theoretical maximum performance of 64bit/66MHzPMC).
Nallatech recently undertook a project to design a complex multi-boardreal time image processing system with a mass storage interface using FPGA technology. The application called for the system to be deployed on a commercial aircraft operating at high altitudewith a high-resolution camera being used to capture the effects of atmospheric turbulence. This raw data was to be processedformatted and stored for later analysis.
The intention was to upgrade the system at a later date and use the high resolution video data to drive a decision engine that would control the aircraft's avionic systems. This would allow for a smoother flight and better fuel efficiency.
The sizeweight and power constraints imposed by the operating environment immediately ruled out the use of certain types of form factors and technologies. The computing power required to process the high-resolution data in real time would have translated into multiple server racks of conventional CPUs – an impractical solution in this case.
The decision to use an FPGA-based processing system was taken early in the project. FPGAs were the only available technology offering the performancespeed and density for the task in hand. An ASIC solution was considered too expensive with lengthy development timescales and no flexibilityshould user requirements change. The Nallatech BenNUEY-PC104+ DIME-II carrier card was selected as the main processing platform for the system. The PC104plus form factor satisfied the physical and mechanical constraints of the applicationwhile the scalability and flexibility of the high bandwidth DIME-II architecture allowed the system to be tailored through the support of plug-and-play DIME-II COTS modules.
The optical interface to the aircraft’s high-resolution onboard camera was handled by a DIME-II module called the ‘BenHOTLINK’ that consisted of a Cypress Hotlink transceiver chip closely coupled to a XilinxVirtex-II6000FPGA. The proximity of the FPGA to the front end of the system provided a reconfigurablelow latency processing block that was able to perform massively parallel DSP calculations. The embedded 18-bit multipliers and dual port BRAM of the Virtex-II device were an ideal resource for the image processing algorithms being used. The same functionality implemented using DSP processors would have resulted in an I/O bottleneckand latencies preventing real time processing.
The Xilinx 2v6000 FPGA situated on the BenNUEY-PC104+ carrier card was used to format the processed image data from the BenHOTLINK FPGA. Eight Mbytes of fast access ZBT SRAM memory attached directly to the BenNUEY’s FPGA was used to buffer the data while it was serialised and transmitted over high speed LVDS links to a bank of 4 SCSI hard drives – providing a total storage capacity of one Terabyte. The data capture and format section of the system operated at 80MHzwith the serial links running at 200MHz.
The majority of the design was written in VHDL in order to achieve the high levels of system performance required for real time operation. Samples of RTL code were simulated with testbenches using Aldec before being synthesised and implemented onto the hardware. Xilinx’s ChipScope ILA tool was used to capture and debug the more complicatedtiming sensitive parts of the design.
Using existing COTS hardware to build the front end of the system and the secondary processing/data formatting section helped significantly reduce development timescaleshowever a custom module was required to allow the direct interface of the SCSI hard drives.
A Xilinx 2v1000 FPGA was situated at the end of each of the high-speed LVDS linkswith a 32-bit Xilinx MicroBlaze embedded processor programmed to deal with the asynchronous data transfer to and from the diskas well as the packet handling and interpretation. The hardware/software partitioning allowed the SCSI interface to be implemented at low speed using asynchronous data transfers. Once this was working successfully for the specific data read and write packetsthe actual writing and reading of the data from the disk was carried out using dedicated FPGA fabric to allow support for much faster synchronous data transfer modes. This is a perfect example of the capability and flexibility of FPGAs in embedded systems. This approach to hardware and software partitioning allowed the system to be implemented and tested on the target hardware far earlier than normal. The flexibility of the FPGAs allowed sections of the design to be optimised without physically altering the hardwarewhile the availability of the spare DIME-II module slots on the BenNUEY-PC104+ offered the customer the option of scaling the system to support additional SCSI disks.
In the longer termsystem performance can be improved by utilising the embedded IBM Power PCs and Multi-Gigabit Transceivers featured in the Virtex-II Proand Virtex-4 FX FPGAs from Xilinx. Instead of using a ‘softcore’ processor such as Microblaze (which uses FPGA logic resource such as BRAM)the embedded PowerPCs operating at 300MHz would have provided a higher performancefixed silicon solution. Furthermorethe LVDS links could be upgraded to SATA connections to each of the SCSI drives. The same principles of partitioning would have applied in terms of handling general packets via software with the PPC coreand the handling of specific packets implemented using dedicated FPGA logic using an HDL language.
Dr Malachy DevlinCraig PetrieCraig Sandersonand Derek Stark are with Nallatech. For more informationvisit www.nallatech.com"