|Thesis abstract: |
The ever increasing complexity} of embedded systems is driving design methodologies towards the use of abstractions higher than the Register Transfer Level (RTL). In this scenario, High Level Synthesis (HLS) plays a significant role by enabling the automatic generation of custom hardware accelerators starting from high level descriptions (e.g., C code). Typical HLS tools exploit parallelism mostly at the Instruction Level (ILP). They statically schedule the input specifications, and build centralized Finite Stat Machine (FSM) controllers. However, the majority of applications have limited ILP and, usually, centralized approaches do not efficiently exploit coarser granularities, because FSMs are inherently serial. Novel HLS approaches are now looking at exploiting coarser parallelism, such as Task Level Parallelism (TLP). Early works in this direction adopted particular specification languages such as Petri nets or process networks, reducing their applicability and effectiveness in HLS. This work presents novel HLS methodologies for the efficient synthesis of C-based parallel specifications. In order to overcome the limitations of the FSM model, a parallel controller design is proposed, which allows multiple flows to run concurrently, and offers natural support for variable latency operations, such as memory accesses. The adaptive controller is composed of a set of interacting modules that independently manage the execution of an operation or a task. These modules check dependencies and resource constraints at runtime, allowing as soon as possible execution without the need of a static scheduling. The absence of a statically determined execution order has required the definition of novel synthesis algorithms, since most of the common HLS techniques require the definition of an operation schedule. The proposed algorithms have allowed the design and actual implementation of a complete HLS framework. The flow features automatic parallelism identification and exploitation, at different granularities. An analysis step, interfacing with a software compiler, processes the input specification and identifies concurrent operations or tasks. Their parallel execution is then enabled by the parallel controller architecture. Experimental results confirm the potentiality of the approach, reporting encouraging performance improvements against typical techniques, on a set of common HLS benchmarks. Nevertheless, the interaction with software compilers, while profitable for the optimization of the input code, may represent a limitation in parallelism exploitation: compilation techniques are often over-conservative, and in the presence of memory operations accessing shared resources, force serialization. To overcome this issue, this work considers the adoption of parallel programming paradigms, based on the insertion of pragma annotations in the source code. Annotations, such as OpenMP pragmas, directly expose TLP, and enable a more accurate dependences analysis. However, also in these settings, the concurrent access to shared memories among tasks, common in parallel applications, still represents a bottleneck for performance improvement. In addition, it requires concurrency and synchronization management to ensure correct execution. This work deals with such challenges through the definition of efficient memory controllers, which support distributed and multi-ported memories and allow concurrent access to the memory resources while managing concurrency and synchronization. Concurrency is managed avoiding, at runtime, multiple operations to target the same memory location. Synchronization is managed providing support for atomic memory operations, commonly adopted in parallel programming. These techniques have been evaluated on several parallel applications, instrumented through OpenMP pragmas, demonstrating their effectiveness: experimental results show valuable speed-ups, often close to linearity with respect to the degree of available parallelism.