Embedded, Networked and Distributed Systems
This group systematically explores architectures, models, algorithms, engineering, measurement, and control of embedded, networked, and distributed systems. The group is especially interested in large-scale systems, based on both wireless and wireline interconnection technologies, as well as high-performance system interconnects. Current group interests include:
- Environmental sensor monitoring systems
- Home network monitoring and management systems
- Complex event processing systems
- Routing and broadcasting in wireless, mobile, vehicular, ad hoc networks
- Networked multimedia - transport protocols and standards
- Scalability of the Internet routing architecture
- Content-centric networks
- Novel approaches to network protocol implementation
- Network instrumentation and traffic modelling
- Novel analysis of program code
- Runtime memory management
- Acceleration of computations on parallel platforms
- Programming models for many-core systems
- FPGA and GPU programming
- Functional hardware description languages
- large scale(typically distributed) information/data systems (IDEAS)
Many academic members are also involved in a cross-cutting The Glasgow Parallelism Group (GPG) theme due to the shared interests in multi- and many-core machines, Clouds, High-Performance Computers (HPC). ENDS group also has strong connections with SICSA in the Networking and Systems research theme.
- Home networks: the human-centred approach, a.k.a. the Homework project
- Adaptive Error Measurement, Concealment, and Repair for IP Streaming Video
- Architectures and Protocols for Massively Scalable IP Media Streaming
- Explicit Congestion Notification for RTP over UDP
- The Glasgow Raspberry Pi Cloud
- RELEASE: A High-Level Paradigm for Reliable Large-Scale Server Software
- Adaptive JIT-based Parallelism (AJITPar)
- HPC-GAP: High Performance Computational Algebra and Discrete Mathematics
- WebRTC: Media Transport Protocols and Adaption
- AnyScale Applications
- Glasgow Network Functions
- Prof Peter Triantafillou
- Prof Phil Trinder
- Dr Lewis M Mackenzie
- Dr John T O'Donnell
- Dr Colin Perkins
- Dr Dimitrios Pezaros
- Dr Jeremy Singer
- Dr Tim Storer
- Dr Wim Vanderbauwhede
Research Fellows and Associates
- Dr Christos Anagnostopoulos
- Dr Natalia Chechina
- Dr Paul Harvey
- Dr Patrick Maier
- Dr Yashar Moshfeghi
- Dr Waqar Nabi
- Dr Nikos Ntarmos
- Dr David R. White
Research Assistants and Research Students
- Mr Abyd Adhami
- Mr Malak Al Jabri
- Mr Gubran Alkubati
- Mr Khaled Alnowaiser
- Mr Dhahi Sulaybikh D Alshammari
- Mr Blair Archibald
- Mr Atoshum (Samuel) Cahsai
- Mrs Mozhgan Kabiri Chimeh
- Mr Niaz Chowdhury
- Mr Michael Comerford
- Mr Richard Cziva
- Mr Joe Davidson
- Mr Simon Jouet
- Ms Foteini Katsarou
- Mr Aamir Khan
- Miss Sharifa Al Khanjari
- Mr Wing Li
- Mr Georgios Maniatis
- Mr Stephen McQuistin
- Mr Stuart Monro
- Mr Magnus Morton
- Mr Georgios Sfakianakis
- Mr Ashkan Tousimojarad
- Ms Jing Wang
- Mr Kyle White
- Performance measurement and analysis
- network measurement
- policy-based network management
- network modelling
- next generation internet
- complex systems engineering
- networked multimedia
- parallel programming
- embedded systems
- sensor networks
- digital circuit design
- task parallelism
- data parallelism
This Week’s Events
There are no events scheduled for this week
There are no upcoming events
Speaker: Prof Neil Bergmann, The University of Queensland , Australia
Date: 17 August, 2016
Time: 17:00 - 18:00
Location: Sir Alwyn Williams Building, 423 Seminar Room
Wireless sensor networks are becoming the eyes and ears (and other senses) of the Internet, allowing high temporal and spatial sampling of data from both the natural and the built environment. The benefits of wireless operation often mean that such sensor nodes are battery powered, perhaps with some energy harvesting. Usually, such sensors are limited in their temporal resolution by their limited energy. "Dumb" sensors simply record and transmit raw transducer data streams for subsequent data analysis by powerful processors. The majority of the energy used by such sensors is in the radio transmission of the raw data. Communications energy can be saved, if the data can be compressed or otherwise on a "smart" sensor node, and only compressed or summary information sent, but this requires energy-efficient on-node processing. This seminar summarises results from a past project using machine-learning techniques for on-sensor processing, and discusses proposals for how this on-sensor processing can be done in a more energy efficient fashion using reconfigurable hardware (FPGAs).
Biography. Prof. Neil Bergmann has been the Chair of Embedded Systems in the School of Information Technology and Electrical Engineering at the University of Queensland, Brisbane, Australia since 2001. He has Bachelors degrees in Electrical Engineering, Computer Science, and Arts from the University of Queensland, and a PhD in Computer Science from the University of Edinburgh in 1984. His research interests and in computer systems, especially reconfigurable computing and Wireless Sensor Networks. He is on sabbatical leave, visiting University of Edinburgh during August 2016.
Manycore processors mark time. They barely reach a hundred of cores after ten years of existence. According to Moore's law, we should have more than a thousand. The GPU themselves have more than 5000 SP+DP cores. In the talk I will show that it mainly comes from a useless complexity of the memory (tens of MB when a GPU uses only a few MB) and the interconnect (a NoC or a ring, when in a GPU, cores are simply abutted). I will inventoriate the needed hardware to compute in parallel. I will insist on the importance of determinism, the uselessness of memory and I will point out a favoured communication direction, from the cause to the effect of a causality. I will describe the design of a parallelizing core, build to be combined with itself to form a 3000 core processor. The core design is simple because it embarks quite no memory and it has connections only with two neighbours.
On the software side, I will present a new parallel programming model, not based on an OS thread parallelization but on a hardware parallelization relying on a new "fork" machine instruction added to the ISA. I will present various patterns to parallelize imperative language programming structures: functions, for and while loops, reductions. I will show how such patterns can be used to parallelize classical C functions and how the created threads populate the available hardware thread slots in the processor cores.
The hardware does not use memory. Instead, each core uses a set of registers and functional units which are enough to compute scalars from scalars. In our parallel programming model, we avoid data structures: no arrays, no structures, no pointers, no lists. A parallel computation gets the elements of structured data from parallel inputs and puts the computed elements of structured data from parallel outputs. Inside the computation, only scalars are handled.
The proposed parallel programming model is deterministic. The semantic of a parallel execution is given by a referential sequential order. Hence, running the code sequentially or in parallel produces the same result. Testing and debugging parallel programs is as easy as testing and debugging a sequential run.
Our parallel programming model has strong connections with the functional programming paradigm from the composition of no side effect functions.
The increasing importance of parallelism has motivated the creation of
better abstractions for writing parallel software, such as structured
parallelism using nested algorithmic skeletons. However, statically
choosing a combination of algorithmic skeletons that yield good
speedups when compared with a manually optimised solution remains a
difficult task. In order to do so, it is crucial to be able to
simultaneously reason about both the cost of, and semantic
equivalences between different parallel structures. In this talk, I
will present a new type-based mechanism for reasoning about these
properties, focusing on the introduction of parallelism to a
specification of the functional behaviour of a program. This mechanism
exploits well-known properties of a very general recursion pattern,
hylomorphisms, and a denotational semantics for structured parallel
processes described in terms of hylomorphisms. Using this approach, it
is possible to determine formally whether it is possible to introduce
a desired parallel structure to a program without altering its
functional behaviour, and choose a structure that minimises some
parameter cost model.
GPG: Inferring Program Transformations from Type Transformations for Partitioning of Ordered Sets into Overlapping Sections
In the distributed computation of finite difference grids (e.g. weather simulations), partitioning of arrays into overlapping sets is an essential step. The overlapping regions are commonly known as "halos". I present a formalism for order-preserving transformations of such halo-vector types into overlapping sections. I will show that this formalism allows to automatically derive instances of dataflow-style programs consisting of opaque element-processing functions combined using higher-order functions.
Imagine developing an app that can run on any device scale, from a tiny wireless mote to a massive cloud datacenter. "Write once, scale anywhere" is the vision of our Anyscale Apps project funded by EPSRC. We are now half-way through the project, so this is a good time to take stock. Key concepts that have emerged so far include:
- Task variants - interchangeable components with different non-functional characteristics but the same API - e.g. quick-and-imprecise face recognition versus more complex image processing.
- Economic utility theory to select which variants will execute at a given time
- The need for realistic benchmarks and testbeds - currently we are evaluating a multi-scale Robot platform.
Until recently the programming of heterogeneous accelerators, such as GPUs, has largely revolved around low-level programming models that were designed to match the capabilities on the hardware. As heterogeneous programming has become more mainstream the need for higher-level models, based on the requirements of the programmer, has become apparent. Many approaches have appeared, but those extending the C++ programming model have seen the most interest as they manage to provide both high-level abstractions and low-level control if required. This has led to a standardisation of key approaches and proposed attempts to provide a unified hardware description.
This talk will describe some of the standardised approaches for parallel and heterogeneous programming in C++ that are appearing today, such as SYCL for OpenCL and the C++17 Parallel STL. Then there will be a more speculative look at how C++ could look when programming forthcoming hardware, such as the Heterogeneous System Architecture (HSA), where the CPU and accelerators are more tightly connected.
Using static analysis techniques compilers for lazy functional languages can identify parts of a program that can be legitimately evaluated in parallel with the main thread of execution. These techniques can produce improvements in the runtime performance of a program, but are limited by the static analyses’ poor prediction of runtime performance. This talk outlines the development of a system that uses iterative compilation in addition to well-studied static analysis techniques. Our representation of the parallel programs allows us to use traditional 'blind' search techniques or profile-directed improvement. We compare the results of different search strategies and discuss the pitfalls and limitations of our technique. Overall, the use of iterative feedback allows us to achieve higher performance than through static analysis alone.
Speaker: Ms Konstantina Panagiotopoulou, Heriot-Watt University
Date: 09 December, 2015
Time: 11:00 - 12:00
Location: University of Glasgow
The rapidly increasing number of components in modern High Performance Computing (HPC) systems provides a challenge on their resilience; predictions of time between failures on ExaScale systems range from hours to minutes. Yet, the prevalent HPC programming model today does not tolerate faults. This talk features the design and initial implementation of transparent resilience for Chapel, a parallel HPC language following the Partitioned Global Address Space (PGAS) programming model.
We address cases of hardware failure on one or multiple nodes during program execution in a distributed setup, using detection and recovery mechanisms. We focus on the runtime system, particularly on the communication (GASNet) and tasking layers to address task parallelism and extend the work on library level to handle data parallelism. Ongoing work addresses integration of distributed task adoption strategies with Chapel's default data distributions. This talk summarises results and experiences from a 2-month internship with the Chapel developer's group at Cray.
Bio: I'm a third year PhD student at Heriot-Watt University and member of the Dependable Systems Group. My research interests are in the design and implementation of Partitioned Global Address Space (PGAS) languages, in the context of High Performance Computing, with focus on resilience. PGAS languages implement the concept of a shared global address space, and use language constructs to distribute data structures over machines, providing the programmer with opportunities to tune data locality and enhance performance. Resilience, is one of the main challenging topics at ExaScale, and an area of gaining popularity among researchers.
GPU and multicore hardware architectures are commonly used in many different application areas to accelerate problem solutions relative to single CPU architectures. The typical approach to accessing these hardware architectures requires embedding logic into the programming language used to construct the application; the two primary forms of embedding are: calls to API routines to access the concurrent functionality, or pragmas providing concurrency hints to a language compiler such that particular blocks of code are targeted to the concurrent functionality. The former approach is verbose and semantically bankrupt, while the success of the latter approach is restricted to simple, static uses of the functionality.
This talk is about combining actor-based programming and OpenCL to simplify programming multicore CPUs and GPUs.
For years people have been designing electronic and computing systems focusing on improving performance but only "keeping power and energy consumption in mind". This is a way to design energy-aware or power-efficient systems where energy is considered as a resource whose utilization must be optimized in the realm of performance constraints.
Increasingly, energy and power turn from optimization criteria into constraints, sometimes as critical as, for example, reliability and timing. Furthermore, quanta of energy or specific levels of power can shape the system's action. In other words, the system's behaviour, i.e. the way how computation and communication is carried out, can be determined or modulated by the flow of energy into the system. This view becomes dominant when energy is harvested from the environment or strictly rationed if it comes from internal sources. This view is also analogous to what happens in biological systems.
In this talk we look at the energy-modulated computing paradigm and illustrate its manifestations in system design, such as:
- Converting electric charge into causality and self-timed operation
- Using concurrency for best energy utilisation
- Models for designing energy-proportional computers (resources, modes, order graphs, partial orders)
The talk will hopefully be motivating to a wide range of audience, including electronic and computer engineers interested in physical and mathematical aspects of (concurrent) computations.
Biography. Alexandre (Alex) Yakovlev was born in 1956 in Russia. He received D.Sc. from Newcastle University in 2006, and M.Sc. and Ph.D. from St. Petersburg Electrical Engineering Institute in 1979 and 1982 respectively, where he worked in the area of asynchronous and concurrent systems since 1980, and in the period between 1982 and 1990 held positions of assistant and associate professor at the Computing Science department. Since 1991 he has been at the Newcastle University, where he worked as a lecturer, reader and professor at the Computing Science department until 2002, and is now heading the MicroSystems research group (http://async.org.uk) at the School of Electrical and Electronic Engineering. His interests and publications are in the field of modelling and design of asynchronous, concurrent, real-time and real-power circuits and systems. He has published six monographs and more than 350 papers in academic journals and conferences, has managed over 30 research contracts and supervised over 40 PhD students. He has been a general chair and PC chair of several international conferences, including the IEEE Int. Symposium on Asynchronous Circuits and Systems (ASYNC), Petri nets (ICATPN), Application of Concurrency to Systems Design (ACSD), Network on Chip Symposium (NOCS), and has been a chairman of the Steering committee of the ACSD conference for the last 15 years. In 2011-2013 he was a Dream Fellow of EPSRC, UK, to investigate different aspects of energy-modulated computing.
Speaker: Prof Joe Armstrong, KTH Royal Institute of Technology, Ericsson AB
Date: 19 November, 2015
Time: 11:00 - 12:00
Location: University of Glasgow
Changing a popular program language is very difficult, instead of changing Erlang I define a language erl2 while compiles to erlang.
This talk discusses the following:
- Why changing a language is difficult
- New language, modified language or code generator?
- What's new in erl2
- Mutable value chains
- "black box" crash recorders
- Global "global" processes
This is work in progress
The Erlang language provides features which make it very easy to implement applications distributed over large networks. The problem then arises of how one should deploy such applications, particularly in networks where the individual nodes may have varying characteristics and where communication times may be non-uniform.
In this talk, I'll describe some work from the RELEASE project at Glasgow where we designed and implemented libraries providing methods for "semi-explicit placement", where the programmer selects nodes for spawning remote processes based on properties of nodes and communication latencies. We claim that these methods will help programmers to achieve good performance in a portable way, without requiring detailed knowledge of the network in advance.
I'll give an introduction to Erlang and the issues that arise in deploying distributed applications, and then describe our libraries, including some topological and statistical methods used in their design and validation. The São Tomé shorttail and other birds will also put in an appearance.
GPG: Obliterating Obstructions: Detecting Dependencies Disruptive to Parallelisation in Recursive Functions
To take advantage of increasingly parallel hardware, a simple, safe, and effective method to introduce parallelism is needed. Current approaches can be divided into two broad categories: automatic, and abstraction. Whilst fully automatic solutions mean the programmer need not lift a finger, they tend to target highly specific constructs, rendering them virtually useless in all but a few situations. Conversely, the development of better abstractions and interfaces presents a more general solution, but still requires a level of expertise from the programmer to be effective. This is especially pronounced when a program must be transformed to enable the introduction of parallelism. To reduce the burden this transformation phase places on the proverbial programmer, we propose a method that uses static analysis techniques to identify operations within tail-recursive functions that are obstructive to the introduction of parallelism, and use refactoring techniques to extract and expose potential parallelism in spite of those obstructions.
Bio: Studying under Prof Kevin Hammond and Dr Christopher Brown, Adam is a PhD student at the University of St Andrews. He is currently interested in dependency analysis and program transformation techniques, principally to enable the introduction of parallelism to sequential programs. Having cut his teeth on Prolog and Miranda back at UCL, he currently has an odd fixation with Erlang, but is not above tinkering in Haskell or perpetually intending to play with Idris.
Scaling distributed computing environments to extend from 100's to 1000's of nodes is a challenge for programming environments such as Erlang/OTP. Traditional large scale compute environments use MPI or offer simple Grid computing using Distributed Resource Managers. Functional languages such as Erlang provide greater flexibility and programmer productivity than MPI or Grid based programs but don't scale as well. This session will look at the scaling constraints of ErlangOTP and discuss opportunities for utilising some of the techniques used by MPI, such as RDMA and Multicast to improve its scalability.
Widely adumbrated as patterns of parallel computation and communication, algorithmic skeletons introduce a viable solution for efficiently programming modern heterogeneous multi-core architectures equipped not only with traditional multi-core CPUs, but also with one or more programmable Graphics Processing Units (GPUs). By systematically applying algorithmic skeletons to address complex programming tasks, it is arguably possible to separate the coordination from the computation in a parallel program, and therefore subdivide a complex program into building blocks (modules, skids, or components) that can be independently created and then used in different systems to drive multiple functionalities. By exploiting such systematic division, it is feasible to automate coordination by ad- dressing extra-functional and non-functional features such as application performance, portability, and resource utilisations from the component level in heterogeneous multi-core architecture. In this paper, we introduce a novel approach to exploit the inherent features of skeleton-based applications in order to automatically coordinate them over heterogeneous (CPU/GPU) multi-core architectures and improve their performance. Our systematic evaluation demonstrates up to one order of magnitude speed-up on heterogeneous multi-core architectures.
Bio. Mehdi Goli is a software engineer at Codeplay Ltd. He has got his PhD on "Autonomic Behavioural Framework for Structural Parallelism over Heterogeneous Multi-Core Systems" at IDEAS Research Institute, Robert Gordon University. His current research interests include high-performance computing, scientific GPU computing, parallel computing. He is one of the main designers and developers of heterogeneous back-end for the FastFLow programming framework.
The compiler implements a parallel extension of C using a data parallel notation similar to the Intel Cylk compiler. It is undergoing tests at the moment but already supports auto-vectorisation and map reduce. Within the next week or two it should have automatic multi-core parallelisation. It is implemented in Java and uses the gcc preprocessor and linker.
Tracing JIT compilation generates units of compilation that are easy to analyse and are known to execute frequently. The AJITPar project aims to investigate whether the information in JIT traces can be used to make better scheduling decisions or perform code transformations to adapt the code for a specific parallel architecture. To achieve this goal, a cost model must be developed to estimate the execution time of an individual trace.
This paper presents the design and implementation of a system for ex- tracting JIT trace information from the Pycket JIT compiler. We define four increasingly parametric cost models for Pycket traces. We test the accuracy of these cost models for predicting the cost of individual traces on a set of loop-based micro-benchmarks. We also compare the accuracy of the cost models for predicting whole program execution time over the Pycket benchmark suite. Our preliminary results show the two simplest models provide only limited accuracy. The accuracy of the more com- plex models depends on the choice of parameters; in ongoing work we are systematically exploring the parameter space.
Bio: Dave Britton is a professor of physics at the University of Glasgow and Project Leader of the GridPP project that provides Grid computing for particle physics throughout the UK. He is a member of the ATLAS collaboration, one of the experiments at the Large Hadron Collider at CERN with an interest in Higgs decaying to a pair of tau-leptons. Previously he worked on CMS, another of the LHC experiments, qualifying the crystals that make up the end-caps of the electromagnetic calorimeter. He has also worked at the Stanford Linear Accelerator (the BaBar experiment); Cornell (the CLEO experiment); and at DESY in Hamburg (the ARGUS experiment) with an emphasis on tracking detectors. Earlier work at TRIUMF in Vancouver established the most stringent limits on lepton universality through rare pion decays. He has been involved with the GridPP project since conception in 2000 and was one of the lead authors of the proposals for all three phases. Initially appointed as Project Manager, he took over as the GridPP Project leader in 2008. GridPP is a collaboration of Particle Physicists and Computing Scientists from 19 UK Universities together with the Rutherford-Appleton Laboratory and CERN, who have built a Grid for Particle Physics.
Profiling tools are essential for understanding and tuning the performance of both programs and parallel language implementations.
Assessing the performance of a program in a language with high-level parallel coordination is often complicated by the layers of abstraction present in the language and its implementation. We investigate whether it is possible to profile parallel Domain Specific Languages (DSLs) using existing host language profiling tools. The key challenge is that the host language tools report the performance of the DSL runtime system (RTS) executing the application rather than the performance of the DSL application. The key questions are whether a correct, effective and efficient profiler can be constructed using host language profiling tools; is it possible to effectively profile the DSL implementation, and what capabilities are required of the host language profiling tools? We develop a profiler for the parallel DSL, Haskell Distributed Parallel Haskell (HdpH) using the host language profiling tools. We show that it is possible to construct a profiler (HdpHProf) to support performance analysis of both the DSL applications and the DSL implementation. The implementation uses several new GHC features, including the Ghc-Events Library and ThreadScope, and for the first time two performance analysis tools for DSL HdpH internals, i.e. Spark Pool Contention Analysis, and Registry Contention Analysis.
The subgraph isomorphism problem is to find a little pattern graph inside a big target graph. Most algorithms for the problem are based upon inference and backtracking search. I'll look at one of these algorithms, and discuss how to parallelise it. The main complication is backjumping: when a conflict is reached, this algorithm can sometimes prove that it is safe to backtrack several steps immediately. I'll discuss how we can refactor backjumping as a special kind of fold, and then explain why the standard fold skeleton is no good: to avoid getting an absolute slowdown, we need both controlled work-stealing, and work cancellation, neither of which have been given the attention they deserve in the literature.
SLAM - simultaneous location and mapping - is a key platform for understanding 3D environments for a huge range of applications, spanning robotics and augmented reality and beyond. Building a really usable map of the environment requires "dense" methods currently feasible in realtime only on powerful hardware. This talk will introduce SLAMBench, a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs in performance, accuracy and energy consumption of a dense RGB-D SLAM system. SLAMBench provides a KinectFusion implementation in C++, OpenMP, OpenCL and CUDA, and harnesses the ICL-NUIM dataset of synthetic RGB-D sequences with trajectory and scene ground truth for reliable accuracy comparison of different implementation and algorithms. We present an analysis and breakdown of the constituent algorithmic elements of KinectFusion, and experimentally investigate their execution time on a variety of multicore and GPU-accelerated platforms. For a popular embedded platform, we also present an analysis of energy efficiency for different configuration alternatives. This work is part of a larger research agenda aiming to push the limits of compiler technology up the "food chain", to explore higher-level algorithmic aspects of the design space and low-level implementation choices together, and I will present some preliminary results showing some of the potential of this idea.
Biography: Paul Kelly is Professor of Software Technology at Imperial College London, where he heads the Software Performance Optimisation research group and serves as co-Director of Imperial's Centre for Computational Methods in Science and Engineering (http://www.imperial.ac.uk/people/p.kelly).
Idris (http://idris-lang.org/) is a general-purpose programming language with an expressive type system which allows a programmer to state properties of a program precisely in its type. Type checking is equivalent to formally and mechanically checking a program's correctness. Introductory examples of programs verified in this way typically involve length preserving operations on lists, or ordering invariants on sorting.
Realistically, though, programming is not so simple: programs interact with users, communicate over networks, manipulate state, deal with erroneous input, and so on. In this talk I will give an introduction to programming in Idris, with demonstrations, and show how its advanced type systems allows us to express such interactions precisely. I will show how it supports verification of stateful and communicating systems, in particular giving an example showing how to verify properties of concurrent communicating systems.
The University is expanding its portfolio of Massive Open Online Courses (MOOCs), hosted on the FutureLearn platform. Wim and I are designing a course entitled 'Introduction to Functional Programming in Haskell'. It's going to be available to distance learners. It will also form the first part of our Functional Programming 4 course.
In this GPG meeting, we'll give some background about MOOCs. We'd be interested to hear your experiences of online learning. Then we will discuss the concepts we hope to cover in the new Haskell MOOC, as well as our strategy for course scalability. Please come along and help to shape this course!
Speaker: Craig Mclaughlin, Dimitar Petrov, Gordon Reid, Glasgow University
Date: 25 March, 2015
Time: 13:00 - 14:00
Location: University of Glasgow
1. Static Verification for Modern Software Systems
Presenter: Craig Mclaughlin
Summary: The current proposal addresses the task of verifying properties of software systems which may be concurrent, distributed and/or operating on heterogeneous architectures. Several projects have extended techniques for sequential programs to the concurrent setting to handle concurrency within GPGPU programming (principally OpenCL/CUDA). Others have taken ideas from type theory to improve static guarantees about communication systems (such as MPI). Recent work has explored combining separation logic and session types to support more powerful reasoning for distributed programs. The aim of the current proposal is to apply this hybrid logic in the context of concurrent and distributed programming models in an effort to enhance the properties one can verify statically about these systems.
2. MapReduce with CUDA to achieve inter- and intra-node parallelism
Presenter: Dimitar Petrov
Summary: As the scale of high-performance computing grows, new needs arise for increased parallelism, reduced complexity and improved programmability. The work proposed here combines MapReduce with CUDA to achieve both inter- and intra-node parallelism. Map and reduce tasks are combined into chunks and executed on the GPU with the aim of increased performance. Although similar solutions have been proposed, all of these rely on the developer's understanding of GPU programming, either CUDA or OpenCL. We propose using Rootbeer to alleviate the issue and make the system accessible to a wider developer audience. Rootbeer allows programmers to write code in Java and have the (de)serialization, kernel code generation and kernel launch done automatically. Rootbeer also provides additional abstractions so that it can run complex Java objects on a GPU.
3. Generating optimal performance portable OpenCL code from existing OpenCL code.
Presenter: Gordon Reid
Summary: OpenCL was developed as a platform-independent way of writing code for execution on a number of different device types. OpenCL guarantees functional portability, but not performance portability. There is a body of existing work in obtaining performance portability using a number of different methods. Some authors have opted for acceptable performance portability between some CPUs and GPUs, doing minimal changes to OpenCL code with some of the changes being manually written by the developer and selected using ifdefs or different source files. Other work takes a more dramatic approach, either requiring the program to be rewritten in another high-level language which is then compiled down to OpenCL code, or are attempting performance portability via a completely new compiler implementation. My plan is in two parts. The first involves using existing OpenCL code like in some previous work but going further and automatically tuning and changing more aspects of the device and kernel code for many device types, including the Intel Xeon Phi. The second involves a look into analysis and optimisations at the SPIR level to explore finer grained optimisations made possible by this new intermediate format.
Speaker: Prof Chris Pearce & Dr Lukasz Kaczmarczyk, University of Glasgow
Date: 18 March, 2015
Time: 13:00 - 14:00
Location: University of Glasgow
Our research is focussed on the computational modelling of materials and structures, with particular focus on multi-scale mechanics and multi-physics problems, applied to problems ranging from safety critical structures to biomechanics, supported by EPSRC, EU, TSB and industry. The Finite Element Method (FEM) is an extremely powerful numerical technique for finding approximate solutions to a broad range of science and engineering processes that are governed by Partial Differential Equations (PDEs). It has revolutionised simulation and predictive modelling in science and engineering and has had a pervasive impact on industrial engineering analysis.
Despite the undoubted success of FEM, there is a continuous drive to push finite element technology beyond current capabilities, to solve increasingly complex real-world problems as efficiently as possible. Established commercial FE software can be relatively slow to adopt new technologies due to the dominance of out-of-date software architecture. Perhaps the greatest part of FE code development is expended in dealing with technical problems related to software implementation, rather than resolving the underlying physics. The biggest challenge is to create a computationally tractable problem, which can be solved efficiently while simultaneously delivering an accurate and robust solution by controlling the numerical error.
The presentation will first present the context of our research, briefly describing some examples from our projects, before looking at more details of our software development platform (MoFEM). This is a flexible and adaptable framework, while tackling the conflicting requirements of accuracy and computational efficiency. The catalyst for the creation of MoFEM was the need for a flexible and numerically accurate modelling environment for multi-physics problems, driven by the need of our industrial partners.
Parallel programming models for many-core systems, such as the OpenCL programming model, claim to allow the construction of portable many-core software. Though performance portability may be an elusive goal, functional portability should not be. Functional portability depends on reliable compilers for many-core programming languages. This presents a real challenge for industry because many-core devices, such as GPUs, are evolving rapidly, as are the associated many-core languages (e.g., a revision of the OpenCL specification appears approximately once every 18 months). Compiler-writers are thus continually playing catch-up.
I will present recent ideas on how to apply random testing (fuzzing) to many-core compilers, in the context of OpenCL. The aim of this work is to help vendors to improve the quality of their compilers by discovering bugs quickly. Following up on successful prior works on sequential compiler testing, we have designed two techniques for generating random OpenCL kernels for purposes of compiler testing. The first approach builds on the Csmith project from the University of Utah (PLDI'11).
Here, we generate random OpenCL kernels that are guaranteed to be free from undefined behaviour and to behave deterministically. For such a kernel, differing results between two OpenCL implementations indicates that one of the implementations has compiled the kernel erroneously.
The second approach builds on the "equivalence modulo inputs" idea from researchers at UC Davis (PLDI'14). Here we start with an OpenCL kernel and generate mutations from the kernel such that, for a given input, each mutant is guaranteed by construction to compute the same result as the original kernel. In this case, observable differences between mutants for the given input indicate compiler bugs.
I will report on a large testing campaign with respect to 19 OpenCL (device, compiler) configurations. We found bugs in every configuration that we tested, including in compilers from AMD, Nvidia, Intel and Altera. Many of the bugs we reported have now been fixed by the associated vendors. In the talk I will show some examples of the bugs the technique has uncovered.
This is joint work with Christopher Lidbury, Andrei Lascu and Nathan Chong, and is due to appear at PLDI'15.
Bio: Alastair Donaldson is a Senior Lecturer in the Department of Computing, Imperial College London, where he leads the Multicore Programming Group and is Coordinator of the FP7 project CARP: Correct and Efficient Accelerator Programming. He has published more than 50 peer-reviewed papers in formal verification and multicore programming, and leads the GPUVerify project on automatic verification of GPU kernels, which is a collaboration with Microsoft Research. Before joining Imperial, Alastair was a Visiting Researcher at Microsoft Research Redmond, a Research Fellow at the University of Oxford and a Research Engineer at Codeplay Software Ltd. He holds a PhD from the University of Glasgow.
Modern multicore systems offer huge computing potential. Exploiting large parallel systems is still a very challenging task, however, especially as many software developers still use overly-sequential programming models. In this talk, I will present a radical and novel approach to introducing and tuning parallelism for heterogeneous shared-memory systems (comprising a mixture of CPUs and GPUs), that combines algorithmic skeletons, machine-learning, and refactoring tool support. Specifically, I will show how to use skeletons to model the parallelism, machine learning to predict the optimal configuration and mapping and refactoring to introduce the parallelism into the application. Finally, I will demonstrate our tools on a number of applications, showing that we can easily obtain comparable results to hand-tuned optimised versions.
GPG: Towards Performance Portability for Heterogeneous Systems (a Unified View of Algorithmic Choices and Hardware Optimisations)
Computing systems have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort. This results in a tension between achieving performance and code portability / ease of programming.
In this talk I will present a novel approach that offers high-level programming, code portability and high-performance. It is based on algorithmic pattern composition coupled with a powerful, yet simple, set of rewrite rules. This enables systematic transformation and optimization of a high-level program into a low-level hardware specific representation which leads to high performance code. I will show how a subset of the OpenCL programming model can be mapped to low-level patterns and how to automatically generate high performance OpenCL code on par with highly tuned implementations for multicore CPUs and GPUs.
A technical report describing this work can be found on arXiv: http://arxiv.org/abs/1502.02389
Bio: Christophe Dubach received his Ph.D in Informatics from the University of Edinburgh in 2009 and holds a M.Sc. degree in Computer Science from EPFL (Switzerland). He is a Lecturer (Assistant Professor) in the Institute for Computing Systems Architecture at the University of Edinburgh (UK). In 2010 he spent one year as a visiting researcher at the IBM Watson Research Center (USA) working on the LiquidMetal project. His current research interests includes high-level programming models for heterogeneous systems, co-design of both computer architecture and optimising compiler technology, adaptive microprocessor, and the application of machine learning in these areas.
Algorithms expressed using a small collection of combinators, including map, fold, scan, and sweep, can be implemented efficiently on circuits, FPGAs, and GPGPUs. Equational reasoning is effective for deriving such algorithms and proving them correct. I will describe its use in recent and current work on a family of algorithms related to functional arrays.
This talk covers the work I did over the summer at Kyoto University in Japan. It involved porting a Large Eddy Simulator to OpenCL and creating the Glasgow Model Coupling Framework (GMCF), a novel model coupling framework, aimed at creating systems of communicating simulators. I will discuss the porting approach and performance of the OpenCL LES and explain the rationale and architecture of GMCF, and discuss our current work on it.
For performance, modern CPUs (and GPUs) admit relaxed behaviour: violations of sequential consistency. To enable efficient implementation above relaxed hardware without fencing simple memory accesses, programming languages like C11 and C++11/14 also admit relaxed behaviour, introducing substantial complexity to the programming model. My work provides a formal semantics for C11 and C++11/14 concurrency that was developed in close communication with the ISO committees that define the languages. The precise formal model has been used for several positive results: to identify and fix problems in the ISO specification, for proofs of the soundness of compilation mappings, in the development of formal reasoning principles, and in the proofs of basic tenets of the design. At the same time, the formal model allows us to criticise the design, and pinpoint its flaws.
In this talk I will review some of the positive results built on the formal C++11 memory model, including the proof of one of the key design goals of the language: that simple data-race-free programs are provided with sequentially consistent semantics (DRF-SC). I will then show that the C++11 memory model is fatally flawed when applied to C-like languages. I will discuss what might be done about this open problem, and I will talk about the extension of formal relaxed memory models to cover GPU computing.
Speaker: Dr Josef Svenningsson, Chalmers University, Sweden
Date: 10 December, 2014
Time: 13:00 - 14:00
Location: University of Glasgow
We present two different representations for functional parallel arrays, Pull- and Push arrays. Functions written using these representations can easily be fused, which means that there is no performance penalty in writing in a high-level compositional style. We show how to use these representations in conjunction and how allows for a very expressive programming style for writing efficient array programs.
GPG: Modelling atmospheric aerosol: why should we care about complexity and how do we find out associated impacts?
Uncertainties associated with impacts of aerosol particles on climate are larger than those of any other atmospheric components. In addition, fine particulate material is widely acknowledged as one of the most important pollutant impacting on air quality and human health. Comprised of both inorganic and organic material, inorganic material is restricted to a few well-understood compounds. However, organic material can comprise many thousands, as yet largely unidentified, compounds with a vast range of properties. Owing to this complexity and diversity of atmospheric aerosol components, quantification of the properties that determine their highly uncertain climatic and human health impacts requires the development of novel technological applications. Encompassing multiple disciplines, this includes developments in the field of physics, chemistry, mathematics, engineering and computing.
Regional coupled chemistry-climate models attempt to carry our 'best knowledge' of how aerosols form and evolve on a UK and EU wide scale. Unfortunately, the process of embedding our 'best knowledge' is far from ideal. Chemical complexity comes at a premium and the associated impacts remain unknown. Mitigating this problem requires solutions from the fields of mathematics and computing. In this talk I will cover research carried out at the University of Manchester in this area along with developments from external collaborators.
Codes for computational science and downstream analysis (visualization and/or statistical modelling) have historically been dominated by imperative thinking. While this situation is evolving, e.g. through adoption of functional ideas in toolkits for high-performance visualization, we are still a long way from seeing a functional language such as Haskell used routinely in live application, certainly for those involving peta-scale data and above.
This talk describes recent work on multifield data analysis that has led to new questions in nuclear physics, and the expanding role of Haskell in this research programme. Following an introduction to visualization and topological analysis, I will describe the recent analysis of data from HPC simulation of nuclear fission undertaken by US collaborators that has led to new insight into the process of fission. The talk will cover ongoing work using parallel functional programming on both shared and distributed memory architecture, and conclude with questions about the utility and future of functional programming in large-scale computational science.
In this talk we'll discuss the Python language - and scripting languages in general - and its suitability for parallel programming. I'll present my 5th year project that is parallelising compiler for Python.
Since Turing established the equivalence of Turing machines and the lambda calculus in 1936/7, proof of the Turing computability of Curry’s combinatory calculus seems to have rested indirectly on its equivalence to the lambda calculus rather than direct construction. In this seminar, a Turing machine that reduces combinator expressions is presented. The TM is over 1000 quintuplets long and further illustrates the utter unsuitability for software engineering of Turing’s elegant and succinct model of computability.
At the tail end of the HPC-GAP project I spent some months trying to parallelise a computer algebra problem with the aim to scale up to supercomputers.
In this talk I'll gently introduce the problem, namely finding invariant bilinear forms in the representation theory of Hecke algebras. I'll present a schematic overview of the sequential algorithm (originally written in the computer algebra system GAP) and discuss how to parallelise it (using the SymGridPar2 framework). I'll present some performance results, including runtime and (estimated) speedup figures on up to 1024 cores, and an analysis why this problem is difficult to parallelise. There is a paper with details: http://link.springer.com/chapter/10.1007%2F978-3-319-09873-9_35
I am not an expert on the algebraic side of this case study; apologies in advance for all the half-truths I'll be telling. (The real purpose of this talk is to help me remember the little algebra I've learned during the project.)
Session types provide a static guarantee that concurrent programs respect communication protocols. Recently Caires, Pfenning, and Toninho, and Wadler, have developed a correspondence between the propositions of linear logic and session typed pi-calculus processes.
In this talk, I'll attempt to relate the cut-elimination semantics of this approach to a more typical operational semantics for session-typed concurrency in a functional language. I'll present a minimal concurrent functional language, GV, with a type system based on Wadler's interpretation of session types. I'll give a small-step operational semantics for GV. I'll develop a suitable notion of deadlock for our functional setting, based on existing approaches for capturing deadlock in pi-calculus, and show that well-typed GV programs are deadlock-free, deterministic and terminating. I'll also define two extensions of GV and show that they preserve deadlock freedom.
GPG Seminar - Hardware Support for Shared-memory Concurrency: Reconciling Programmability with Performance
With regard to hardware support for shared-memory concurrency, an inherent tradeoff between programmability and performance is presumed. For instance, the most intuitive memory consistency model, sequential consistency (SC), is presumed to be too expensive to support; likewise primitive synchronization operations like memory fences and atomic read-modify-writes (RMWs) (which are used as the building blocks of higher level synchronization constructs) are costly in current processors; finally, there are question marks about whether cache coherence will scale with increasing number of cores.
In this talk, I will argue that it is indeed possible to provide hardware support that enhances programmability without sacrificing performance. First, I will show how SC can be enforced efficiently using conflict ordering, a novel technique for achieving memory ordering. Second, I will show how RMWs can be implemented efficiently in x86 architectures. I will conclude the talk with a scalable approach to cache coherence called consistency-directed coherence.
The analysis of human activity from video sequences has witnessed a major growth in applications including surveillance, vehicle autonomy and intelligent domiciles. In many application domains, there is a strong need for intelligent video processing and more context-aware acquisition devices at source, to considerably reduce the amount of data to be transferred between sensors to perform real-time processing.
I will report ongoing work in the EPSRC Rathlin project between Heriot-Watt Edinburgh and Queens Belfasts Universities, where we are addressing such issues with FPGA-based processors for image processing. The focus of this talk will be on the programmability of FPGAs using dataflow programming languages. I will briefly explore existing dataflow architectures and languages, then describe our multicore CPU and FPGA optimisations for a person tracking case study, and conclude with ongoing work on program transformation for heterogeneous architectures involving FPGAs.
The development of parallel hardware has seen radical changes over the last two decades. These changes have made parallel programming main-stream technology, with the advent of multi-cores, and now pose new challenges in exploiting hierarchical, heterogeneous hardware, in the form of clusters of accelerated multi-cores.
In this talk I will examine design principles in a range of high-level parallel programming languages aiming to tackle these challenges and match their development with that of the underlying hardware. As a concrete instance I will discuss abstractions explored in the context of Glasgow parallel Haskell (GpH) and I will give recent performance results indicating how a very high level language approach can manage to adapt to underlying hardware. No previous knowledge of Klingon will be required for this talk.
The AJITPar (Adaptive Just-In-Time Parallelisation) project aims to achieve portable parallel performance by combining dynamic trace-based just-in-time compilation of a high-level parallel functional language with dynamic demand-driven scheduling of parallelism. This will involve estimating the granularity of parallel tasks by online profiling and static analysis, and (in later project stages) adapting granularity by online code transformations.
The starting point of AJITPar is lambdachine, a recently developed sequential Just-In-Time (JIT) compiler for Haskell. In this talk, I'll present our ideas for how to make lambdachine parallel. To this end, we design a low-level domain-specific language (DSL) for task-parallel computations. Specifically, this DSL should deal with task creation and scheduling, communication between and synchronisation of tasks, and serialisation of data (including tasks).
The design goals for this DSL are as follows:
* It should be expressive enough to enable building higher-level abstractions, like algorithmic skeletons.
* It should be flexible enough to express a range of benchmarks, from regular matrix bashing to highly irregular symbolic computations.
* It should support an equational theory of program transformations, to support online transformation in later stages of AJITPar.
* Most importantly, it should be easy to bolt on top of the single-threaded lambdachine runtime system.
[Apologies to those who have heard this talk 4 weeks ago at Heriot-Watt. I'll say exactly the same, just 30% slower to make up for the longer slot.]
Continuing a line of work by Abramsky (1994), by Bellin and Scott (1994), and by Caires and Pfenning (2010), among others, this paper presents CP, a calculus in which propositions of classical linear logic correspond to session types. Continuing a line of work by Honda (1993), by Honda, Kubo, and Vasconcelos (1998), and by Gay and Vasconcelos (2010), among others, this paper presents GV, a linear functional language with session types, and presents a translation from GV into CP. The translation formalises for the first time a connection between a standard presentation of session types and linear logic, and shows how a modification to the standard presentation yield a language free from deadlock, where deadlock freedom follows from the correspondence to linear logic.
ICFP, September 2012; Journal of Functional Programming, Best Papers of ICFP 2012.
Speaker: Dr Cordelia Hall and Dr John O'Donnell, The University of Glasgow
Date: 14 May, 2014
Time: 14:00 - 15:00
Location: University of Glasgow
John gave a talk on ESFAs last year. This is a fine-grain data-parallel algorithm suited to GPUs. We have a direct implementation written in CUDA but this is a large and low-level piece of code. This talk explores techniques used to optimise the code. We've also developed a much more readable language for GPU programming called AL, and have implemented ESFAs using it.
In this talk, we take a look at how suitable FPGAs are for High-Performance Computing (HPC), and outline our proposed approach to achieving optimal performance-per-Watt.
My work tackles an impending software crisis. Developments in hardware technology are set to render programmers unable to write efficient new applications, and to make existing applications run slower on new processors than on old ones. I propose a technique enabling programmers to generate applications that automatically adapt to future processors.
In my talk, I will briefly discuss the roots of the problem and the key insight to address it: computer programs exhibit recurring patterns. Detecting and exploiting these patterns is already a very hard problem. Once done, these patterns can be used to automate the process of transferring programs to future processors. In the remainder of this talk, I will present several activities that follow along this insight.
First, I will present Partans – an autotuning framework for stencil computation. In particular, I will discuss the performance impact of the PCIe interconnect in a seemingly homogeneous system. Second, I present results from a collaborative project with Samsung investigating how to design pattern based programming hierarchies. Finally, I will show some preliminary results on expressing commonly used benchmarks in a patternised way and conclude with an outlook onto my future research activities.
Last year I spent the summer at the University of Aizu in Japan, working on porting a particle dispersion simulation from CUDA to OpenCL. This talk will discuss the results from this, including a number of comparisons on different hardware platforms between the two technologies, including the Xeon Phi.
We present a profiling-based characterisation of 8 small and medium-sized semi-explicitly parallel functional applications from several domains with respect to thread granularity, communication, and memory management profile, which are highly relevant for dynamic and adaptive parallelism control. The results confirm that the parallel Haskell implementation copes well with large numbers of potential threads, quantify the impact of the communication rate on parallel performance, and identify memory management overhead as a major limiting factor to scalability on shared-memory machines. This characterisation helps to improve our understanding of the behaviour of parallel functional programs and discover program characteristics that can be exploited to optimise application performance by improving the effectiveness and efficiency of parallelism management.
Speaker: Susanne Oehler and Paul Cockshott, The University of Glasgow
Date: 19 March, 2014
Time: 13:00 - 14:00
Location: University of Glasgow
This will cover what we have done to enable the Glasgow Pascal Compiler to target the Xeon Phi, what we have learned from doing this, and what open questions there are for the further development of the system. We are particularly interested in soliciting advice from others on what we should do next.
GPG Seminar: Bridging the Divide: A New Tool-Supported Methodology for Programming Heterogeneous Multicore Machines
In this talk, we present a new programming methodology for introducing and tuning parallelism for heterogeneous shared-methodology systems (comprising a mixture of CPUs and GPUs), that combines algorithmic skeletons, Monte-Carlo Tree Search, and refactoring tool support. Using your approach, we demonstrate easily obtainable, significant and scalable speedups of up to 41 over the sequential code on a 24-core heterogeneous multiprocessor, and comparable to the best possible speedups that could be obtained.
As well as offering a simple API, pattern (or skeleton) based parallel programming models define implementation templates which can be tuned automatically. We will review the motivations for such an approach, and present four case-studies of its application.
This is a roadmap talk about general compiler optimisations. I will start by giving a brief overview of how compiler optimisations can be classified, discussing ahead-of-time optimisation, feedback-directed optimisation and runtime optimisation. I will discuss how information sharing enables more effective optimisation. I will argue that this trend is taking us to a Big Brother optimisation utopia.
Speaker: Dr Alexander Konovalov, Centre for Interdisciplinary Research in Computational Algebra, University of St Andrews
Date: 05 February, 2014
Time: 13:00 - 14:00
Location: University of Glasgow
I will give an overview of the project "HPC-GAP: High Performance Computational Algebra and Discrete Mathematics" aimed at parallelising the GAP system (http://www.gap-system.org) to support both shared and distributed memory programming models. In particular, I will describe the memory model used by the multithreaded version of the GAP system, corresponding GAP language extensions, and challenges that we are meeting while making the GAP code thread-safe.
Given the wide scale adoption of multi-cores in main stream computing, parallel programs rarely execute in isolation and have to share the platform with other applications that compete for resources. If the external workload is not considered when mapping a program, it leads to a significant drop in performance. This talk describes an automatic approach that combines compile-time knowledge of the program with dynamic runtime workload information to determine the best adaptive mapping of programs to available resources. This approach delivers increased performance for the target application without penalizing the existing workload. This approach is evaluated on NAS and SpecOMP parallel benchmark programs across a wide range of workload scenarios.
On average, our approach achieves performance gain of 1.5x over a state-of-art schemes.
I'll talk about some work that I did in the recently-ended ADVANCE project (http://www.project-advance.eu/). We were looking at multicore applications built from components, and the goal was to make static predictions of statistical properties (throughput, latency) of the overall application based on the behaviour of the components.
We had some success in this, but also met with some very strange behaviour. I'm hoping that the audience may be able to offer explanations for some of the odd behaviour we observed.
LuvvieScript ( https://github.com/hypernumbers/LuvvieScript )
The old polyglot world of large software development teams supported by diverse operational teams no longer works. The current software environment demands teams that are capable of dripping software releases to the public by continuously deploying features. Modern teams are increasingly eschew normal operations support for DevOps - where the systems maintenance and monitoring is done by the software developers. These ways of working are only possible where features can be delivered 'meat to metal' by the software devs - from human factors and design (usually in the browser) through to the server and load distribution infrastructure - down through persistent storage to the disk and back up to end user.
The resulting deployment environment will include an in-page client-side 'run-time' and a set of server-side Erlang libraries that will encapsulate the browser-server comms enabling the front and back-end processes to send each other messages transparently (and front-end processes to send each other ones too). The developer will no longer directly script the Dom - but operate on a client-side model which will be rendered to the Dom. In this world user actions (mouse clicks, key presses, etc, etc) will present to the developer as messages and have to be actively subscribed to - making for a recognizably 'gen server' style of client-side programming.
About the speaker
Gordon Guthrie has been programming since the (late) 1970s, was Chief Technical Architect at if.com - the Service Architect at the City of Edinburgh and more recently CEO/CTO of Vixo. He has only been programming Erlang for a single decade unfortunately.
Speaker: Dr Paul Cockshott, Dr Paul Siebert, The University of Glasgow
Date: 04 December, 2013
Time: 13:00 - 14:00
Location: University of Glasgow
The prospect of GPU kernel fusion is often described in research papers as a standalone command-line tool. Such a tool adopts a usage pattern wherein a user isolates, or annotates, an ordered set of kernels. Given such OpenCL C kernels as input, the tool would output a single kernel, which performs similar calculations, hence minimising costly runtime intermediate load and store operations. Such a mode of operation is, however, a departure from normality for many developers, and is mainly of academic interest.
Automatic compiler-based kernel fusion could provide a vast improvement to the end-user's development experience. The OpenCL Host API, however, does not provide a means to specify opportunities for kernel fusion to the compiler. Ongoing and rapidly maturing compiler and runtime research, led by Codeplay within the LPGPU EU FP7 project, aims to provide a higher-level, single-source, industry-focused C++-based interface to OpenCL. Along with LPGPU's AES group from TU Berlin, we have now also investigated opportunities for kernel fusion within this new framework; utilising features from C++11 including lambda functions; variadic templates; and lazy evaluation using std::bind expressions.
While pixel-to-pixel transformations are interesting in this context, insomuch as they demonstrate the expressivity of this new single-source C++ framework, we also consider fusing transformations which utilise synchronisation within workgroups. Hence convolutions, utilising halos; and the use of the GPU's local shared memory are also explored.
A perennial problem has therefore been restructured to accommodate a modern C++-based expression of kernel fusion. Kernel fusion thus becomes an integrated component of an extended C++ compiler and runtime.