Energy Efficiency Exploration on the ZYNQ Ultrascale+

Roberto Giorgi¹, Farnam Khalili¹,² and Marco Procaccini¹

¹Department of Information Engineering and Mathematics - University of Siena
²Department of Information Engineering - University of Florence
{giorgi,khalili,procaccini}@dii.unisi.it

Abstract—In the context of Cyber-Physical Systems (CPSs), Single Board Computers (SBCs) could provide adaptivity for various present and future applications, and permit scalability through clusters of SBCs while possibly save energy consumption. In this paper, we explore energy efficiency of a Zynq Ultrascale+ based board developed in the context of the AXIOM project. While an entire framework based on the Zynq Ultrascale+ is still in progress, the board is already available and capable of running a full Linux OS and it is possible to measure energy consumption. We demonstrate a possible architecture based on DataFlow-Threads (DF-Threads), a novel execution model, on the Zynq Ultrascale+ platform, in order to assess the energy efficiency of DF-Threads. We measured the power consumption, while the RAW and RDMA message types were transceived through board-to-board interconnects.

Index Terms—Cyber-Physical Systems, Reconfigurable Systems, FPGA Programming, Thread Level Parallelization, Energy Evaluation

I. INTRODUCTION

Embedded processing encircles relatively every application that exist in our lives which incorporates a wide assortment of hardware and software stages, from exceptionally basic ones to extremely complex ones, depending on the application [1]. Essentially, due to the steadily increasing interests in performance led by market, it is necessary to explore in high performance architectures for embedded computing.

In the context of the AXIOM project [2]–[5] it was realized that there is presently an extreme fragmentation of both devices and tools for embedded processing. Specifically, when more complicated functionalities are required, the entire system must be revised and a new tool-chain must be adopted. Thus, our goal was to permit the programmers to simply deploy the device with a possibly standard and open-source tool-chain based on a full Linux OS software distribution.

An important contribution of the AXIOM was the fabrication of an SBC board ("The AXIOM board") based on FPGA and embedded processor, e.g. Zynq Ultrascale+ [6] and its features are: i) a high speed reconfigurable interconnect for board-to-board communication; and ii) a user-friendly programmable environment, which allows us both to off-load partly program algorithms into accelerators (on programmable logic) and, at the same time, to distribute the computation workloads across boards via DataFlow-Threads, a novel execution model [7]–[9] and iii) the possibility of deploying an open-source tool-chain based upon easy to program concept like OmpSs [10], an OpenMP extension [11], [12]. However, scaling the performance of a computing system while retaining easy programmability is still on the headlines [13]. Additionally, tool-chains require to be integrated with suitable high level synthesis tools in order to have a higher control of the programmable logic [14], [15].

In addition, multi-processor system-on-chips (MPSoCs) are currently well-adopted, but the handling of many threads is still a source of a lot of inefficiencies. Their management must consider not only the order of execution and the quantum time per thread, but also the associations of allocating a thread on a given core. This aspect begins to be serious as the entire system grows in complexity, memory hierarchies, interconnects and distributed resources. In this paper, we propose to reduce such inefficiencies by using an efficient execution model named DF-Threads [7]–[9]. We explore energy efficiency of the AXIOM board when it comes to the distribution of DF-Threads among the AXIOM boards through well-defined and high speed board-to-board message types.

The contribution of this work is to extract the power metrics of the AXIOM board (based on the Zynq U+ FPGA) in respect to the distribution of DF-Threads through different type of board-to-board messages.

The remainder of this paper is structured as follows: In Section II we discuss some related work; in Section III we present the general architecture of the AXIOM board and its soft-IPs like the DF-Threads Scheduler (DFS) [8] on the Zynq Ultrascale+ platform; in Section V and IV, we show some experiments regarding the power consumption of when it comes to transceive high speed board-to-board messages and finally, we conclude the paper.
II. RELATED WORK

Data-flow execution models have been reviewed recently [7], [16], [17] as they promise an elegant way to effectively move data from one computational thread to another one [18]–[20]. Importantly, in such models, the computations can be mostly performed in a producer-consumer manner, while for mutable shared data, the memory model offered by Data-flow Threads (DF-Threads) [7] is enclosing Transactional Memory [21], which is a concurrency control mechanism analogous to database transactions for controlling access to shared memory by replacing locks with atomic execution units, so that user can focus on where atomicity is required.

In this context, the TERAFLUX project [22]–[25] accomplished such data-flow modality while extending to multiple nodes which are executing seamlessly through an appropriate memory semantic [7], [26]. In this semantics a compound of consumer-producer patterns [27], [28] and transactional memory [21], [29] allows a novel combination of data-flow paradigm and transactions in order to solve the consistency issues across nodes, where each node is supposed to be cache-coherent like in a classical multi-core. Additionally, such distributed systems could support fault-tolerance [30], [31], and in this context a data-flow thread may be re-executed without harming the computing program since the thread inputs are maintained before scheduling the corresponding thread.

Nowadays, FPGAs are widely used in prototyping Embedded Computers and more recently have become a significant component as the accelerators in the HPC and CPS field, since they undertake tasks with higher reliability, reconfigurability and energy efficiency [32]. Reconfigurable logics like FPGAs propose outstanding ways to boost specific functions, but need enough tools in order to moderate the complicated programming [32]–[34].

Considering reconfigurable computing platform based on FPGAs, many works have offered solutions to address the issues of dynamic allocation of tasks to general-purpose multi-core processors [35], [36], or reconfigurable logic [37]. Nevertheless, these approaches have been effectively investigated only on single and multi-core super-scalar architectures.

Recent papers discuss further the details of DF-Threads hardware framework [3], [5], [38]–[40], the software layers [41], [42] and application use-cases [4], [43], [44].

III. DF-THREAD MANAGEMENT FOR THE ZYNQ ULTRASCALE+

Recently, there has been a huge exertion to move forward general programming models with thread management such as P-threads, Cilk, OpenMP. But in most of these models, synchronization and distribution of data between cores of different nodes need to be managed manually by programmers and imposes an extra effort [10].

Instead, DF-Threads execution model proposes better scalability by re-managing the distribution of threads based upon the data-flow paradigm [7], [45]. For this reason, a Distributed Thread Scheduler (DTS) is offered by [8], which is a hardware implementation of DF-Threads modality.

Figure 1 shows the block designs exploited to materialize and map the DF-Threads management on the PL part of the Zynq UltraScale+ platform as an individual soft IP. DF-Threads Scheduler (DFS) is tightly coupled (i.e., based on the AXI Stream protocol and proper buffering) to the NIC [46] module to be able to transceive appropriate messages in order to distribute the workloads among the network.

Since the DFS is offloaded on the PL, all overheads regarding the thread management are reduced. As such, this leads to better energy efficiency and accelerates the distribution tasks through the PL as well.

IV. METHODOLOGY

A reliable and precise method to measure and monitor the power consumption of the system is necessary in order to enable optimization towards the energy efficiency. Additionally, the ability to estimate power consumption in a design is mandatory for efficient part selection and system reliability. Referring to the AXIOM board, there are specifically dedicated eight INA219 power monitor integrated circuits to monitor the crucial power rails of the board, and are reported by Table I. These INA219 ICs communicate with the FPGA through an I2C bus connected to the PS, and more detailed information on the INA219 can be found in [47].

In order to distribute threads among the network, there are two types of transceiving messages: 1) RAW and 2) Remote.
TABLE I: AXIOM board’s power supply rail adopting dedicated power monitors

<table>
<thead>
<tr>
<th>Power supply rail</th>
<th>Nominal Voltage [V]</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VCC_INTFP</td>
<td>0.85</td>
<td>PS full-power domain supply voltage</td>
</tr>
<tr>
<td>VCCINT</td>
<td>0.85</td>
<td>PS internal power supply</td>
</tr>
<tr>
<td>INTFP_DDR</td>
<td>0.85</td>
<td>PS DDR controller and PHY supply voltage</td>
</tr>
<tr>
<td>1V2_DDR_PS</td>
<td>1.2</td>
<td>PL DDR supply</td>
</tr>
<tr>
<td>1V2_DDR_PL</td>
<td>1.2</td>
<td>PL DDR supply</td>
</tr>
<tr>
<td>MGTAVCC</td>
<td>0.9</td>
<td>Analog supply voltage for GTH transceiver</td>
</tr>
<tr>
<td>MGTAVTT</td>
<td>1.2</td>
<td>Analog supply voltage for GTH transceiver</td>
</tr>
</tbody>
</table>

Direct Memory Access (RDMA). Data of RAW messages are sourced by the DFS, while RDMA ones include a certain portion of memory to be moved between source and destination nodes. As a result, we measure power consumption of the AXIOM board for each of these message types, when the DFS is running at run-time using NIC. For this purpose, two boards have been interconnected, and one board is configured in server mode (DFS is sender) and the other in client mode (DFS is receiver). As such, a tool is designed to acquire power data by making IOCTL calls to the INA219’s driver that return the current value sunk from each monitored rail.

V. EXPERIMENTS

In order to extract power values for the crucial rails of the board, we performed the experiments while DFS issues the RAW and RDMA messages of 1000M length in 10 cycles. The duration of the test was 240s with 200ms sampling time. Essentially, the total power consumption of the board remained between 1W and 1.6W (sum of the seven crucial power rails). Figure 2 illustrates the maximum power variations for the crucial voltage rails during RAW and RDMA transactions.

As can be seen from Figure 2, the MGTAVTT voltage rail has the highest power consumption since the gigabit transceivers’ termination circuits with 1.2V supply voltage sink larger amount of current. The average power consumption in client mode has 10.15% larger value in comparison with server mode due to an extra processing effort to re-compose the acknowledge message and send it back to the server. Moreover, since in our DFS implementation we did not utilize any access to the PL DDR (we access to the PS DDR), the average power consumption for the 1V2_DDR_PL voltage rail remained below 5.5mW.

Finally, comparing power consumption between RAW and RDMA message, the RDMA message type consumes in average 9.7% less than the RAW message types. This arises from the extra dedicated logics to deal with data of RAW messages while for RDMA messages, the data are efficiently moved to the PS DDR by using the Xilinx Data Mover soft IP.

![Graph](image_url)

**Fig. 2**: Maximum variations in power consumption of crucial voltage rails for the Zynq Ultrascale+ when RAW and RDMA messages are issuing.

VI. CONCLUSION

In this paper, we present the deployment of Scalable Embedded Computers by using DF-Threads to distribute computations across multiple Zynq Ultrascale+ based board (The AXIOM board). We discussed several solutions in the High-Performance Computing domain as well as embedded worlds. We proposed a DF-Threads Scheduler which permits efficient scalability across the boards, and we presented some power measurements in the context of the AXIOM project as well. In order to be able to optimize the efficiency of the board, we explored the energy consumption of the important voltage rails while transceiving is performed across the nodes by the DFS.

Future work invokes further exploration of power consumption for more number of nodes (boards) with more benchmarks.

VII. ACKNOWLEDGMENT

The authors would like to thank with gratitude Davide Catani of SECO (s.r.l) [48] for his support in developing the experiments. This work has been partially supported by the European Commission under the AXIOM H2020 project (id. 645496) and HiPEAC (id. 779656).

REFERENCES


Xilinx Inc., “Xilinx UltraScale Architecture.”


