Getting Started

Requirements

Installation

Install scripts are provided for Linux and MacOS. Installation is optional.

Linux

On linux, the script ‘install.sh’ can be run as root to install goopax in the default system directories.

MacOS

On MacOS, the folder ‘goopax.framework’ can be copied to ‘/Library/Frameworks/’, or to ‘$HOME/Library/Frameworks/’.

Windows, iOS, Android

No installation scripts are provided for these operating systems.

License File

The license file ‘goopax_license.h’ can be placed in the ‘share/goopax/licenses’ subfolder.

Building Programs

Building Programs with cmake

Cmake is the recommended way to build goopax programs. Simply use package goopax:

find_package(goopax)

And link the resulting target ‘goopax::goopax’ to your program:

target_link_libraries(my_program goopax::goopax)

If goopax is installed in a non-standard location, the path to <goopax>/share needs to be specified via the CMAKE_PREFIX_PATH variable, or via the goopax_DIR environment variable.

If goopax_license.h is stored in a different location than ‘share/goopax/licenses’, the path to the folder where it is stored should be set in the ‘GOOPAX_LICENSE_PATH’ environment variable.

Building Programs without cmake

When using other build systems, it is up to the user to set the paths correctly. The compiler needs to find the goopax header files, as well as the license file ‘goopax_license.h’ in its search path. The linker needs to link to the goopax library.

Building the Example Programs

The example programs are a good place to start. The source code is located in folder ‘examples’. The examples can be built with cmake in the usual way, e.g.,

cd examples
mkdir build
cd build
cmake ..
cmake --build .

Programming with Goopax

Program Structure

Including Header Files

To use Goopax, include the header file from your source code:

#include <goopax>

For OpenGL interoperability, include

#include <goopax_gl>

For OpenCL interoperability, include

#include <goopax_cl>

For Metal interoperability, include

#include <goopax_metal>

Namespaces

The basic Goopax functionalities are found in namespace ’goopax’. For simplicity, we will assume that you import this namespace:

using namespace goopax;   // Basic GOOPAX types, such as buffer and gpu_float.

Optionally, one of the following namespaces can be used, for debug and release mode data types, respectively:

using namespace goopax::debug::types;    // Debugging data types Tint, Tfloat, ...
using namespace goopax::release::types;  // Release data types Tint, Tfloat, ...

For some functions that are provided by the C++ standard library, an overloaded function may be provided in the namespace “std”.

Memory

In the GPU kernels, there are four types of memory:

Private Memory

Private memory is allocated as follows:

private_mem<float> A(16);

This will allocate an array of 16 floats for each thread.

Local Memory

Local memory is only available during the execution of a kernel. Each work group has its own local memory. Local memory is useful for communication within a work group. Between work groups no sharing is possible.

Local memory is declared as local_mem<type>, which has the constructor

local_mem(size_t size)

For example:

local_mem<double> mem(256);

will allocate 256 doubles in local memory for each group.

Global Memory

Global memory must be declared on the CPU side as a buffer and on the GPU side as a resource.

Buffer

On the CPU side, the memory is declared as buffer<type>, where the element type is specified as a template parameter. The constructor takes the argument

buffer(goopax_device device, size_t size)

For example:

goopax_device device = default_device();
buffer<float> buf(device, 10);

will allocate a global buffer on the video card of type float and size global_size().

Resource

The resource<type> is declared from within a GPU kernel and has to match a corresponding buffer. Resources can either be declared as parameters, e.g.,

kernel my_kernel(device, [](resource<int>& A)
{
    <GPU code...>
});

or it can be declared within the kernel body and linked to a specific memory buffer:

buffer<float> B(device, 1000);

kernel my_kernel(device, [&B]()
{
    resource<float> R(B);
...
});

When the former constructor is used, the corresponding memory buffer has to be specified as an argument when calling the kernel. With the latter constructor, the resource can already be linked to a specific buffer, so that the buffer does not have to be supplied when executing the kernel.

Svm Memory

Svm memory is declared in a similar way as global memory, e.g.,

goopax_device device = default_device();
svm_buffer<float> buf(device, 10);

Instead of declaring a resource in the kernel, svm memory is accessed via a pointer. The pointer can be provided to the kernel as a normal parameter, e.g.,

kernel prog(device, [](gpu_type<double*> ptr)
{
  ...
});

Not all devices support svm memory. Whether svm memory is supported can be checked with the function device::support_svm().

Memory Access

All memory types can be accessed within a kernel by the usual [] operator or by iterators.

Data Transfer from Host to GPU and from GPU to Host

Data transfer between CPU and GPU can be done with the copy_to_host and copy_from_host functions, or with buffer_map objects. For details, see refman.pdf.

Barriers

To avoid race conditions, it is sometimes necessary to place barriers between memory accesses, especially when local threads are communicating with each other. Race conditions can occur when two or more threads access the same memory address and their access is not properly synchronized.

Local barriers

Threads within a workgroup can be synchronized by calling the local_barrier() function, for example:

local_mem<float> a(local_size());
a[local_id()] = 2;
local_barrier();
gpu_float b = a[(local_id() + 5) % local_size()];

Without the barrier, a race condition would occur. Note: Local barriers only synchronize memory access within a work group. Memory between different groups is not synchronized.

Global Barriers

Global barriers can be placed to synchronize threads across workgroups:

resource<float> A;
resource<float> B;

A[global_id()] = 5;
B[2*global_id()] = 7;

global_barrier();   // This will place a barrier on all threads.

gpu_float x = A[0] + B[129];

User-Defined Types in Memory Access

In addition to using intrinsic types, memory access can also be done with user-defined classes. This can simplify code development and provide a more structured and safe way of accessing the data structures.

The following restrictions apply:

To use user-defines types, additional information must be provided. In the general case, this requires providing specializations of the following structs:

For structs where all template parameters are types used in the data structure, the predefined macro GOOPAX_PREPARE_STRUCT can be used instead. It must be used in the main scope, not within a namespace, and template arguments must be omitted.

GOOPAX_PREPARE_STRUCT

template <typename A, typename B> struct pair
{
  A first;
  B second;
};

// Easy case: Only types are used as template arguments. Can use the macro as follows:
GOOPAX_PREPARE_STRUCT(pair)

General case

template <typename T, size_t N>
struct array
{
    T data[N];
};

// Template arguments include non-types.
// Need to provide the following specializations by hand:
namespace goopax {
template<typename T, size_t N>
struct goopax_struct_type<array<T, N>>
{
    using type = T;
};

template<typename T, size_t N, typename X>
struct goopax_struct_changetype<array<T, N>, X>
{
    using type = array<typename goopax_struct_changetype<T, X>::type, N>;
};
}

Memory Access

Buffers and resources of the new type can be declared in the same way as buffers with intrinsic types, e.g.:

...
resource<complex<float>> my_global_resource;
local_mem<complex<float>> my_local_resource(local_size());
...

int main()
{
  ...
  buffer<complex<float>> my_buffer(default_device(), size);
}

The new type can be accessed in the usual C++ way. Some examples are shown here:

// assigning one item
my_global_resource[global_id()].real = 2.3;

// copying a complete element
my_local_resource[local_id()] = my_global_resource[12];

// modifying one item
my_global_resource[global_id()].imag += 25;

Images

Special data access is provided for images and image arrays. From the host code, images can be allocated with objects of type “image_buffer”. From the device code, images can be accessed by using objects of type “image_resource”. For more information, see the goopax reference.

GPU Kernels

Kernels are the functions that run on the video card and are typically used for the computationally demanding calculations.

Writing Kernel Classes

GPU kernel are specified as objects of type goopax::kernel<arg_t>, where arg_t is the function type. The device and the function with the kernel code are provided to the constructor.

Calling a Kernel

The kernel is executed by calling the ’()’ operator and passing the required buffers as arguments for all the unspecified resources. For example, if the kernel declares two resources

kernel<void(buffer<float>& A, const buffer<int>& B)>
my_kernel(default_device(),
          [](resource<float>& A, const resource<int>& B)
{
    ...
};

or, with C++17 template type deduction guides,

kernel
my_kernel(default_device(),
          [](resource<float>& A, const resource<int>& B)
{
    ...
};

then the kernel must be called with two buffers as arguments of the same type and in the same order, i.e.

goopax_device device = default_device();
buffer<float> A(device, 100);
buffer<int> B(device, device.default_global_size());
MyKernel(A, B);

All kernel calls are asynchronous. The call will, in most cases, return immediately. Only when the results are accessed, the CPU will wait for the kernel to finish.

Valid Kernel Arguments

Valid kernel arguments are buffers, input values, gather values, and images. The CPU argument type from the kernel call must correspond to the matching GPU argument type in the kernel definition:

CPU argument type Kernel argument type
T gpu_T
buffer<T>& resource<T>&
const buffer<T>& const resource<T>&
image_buffer<DIM, T>& image_resource<DIM, T>&
image_array_buffer<DIM, T>& image_array_resource<DIM, T>&
goopax_future<T> gather_result<T>&

Here, T can be any intrinsic type or goopax struct. gpu_T is the corresponding GPU type.

Gathering Return Values

Return values can be combined from all threads, reduced into single values. The return type of the corresponding kernel argument is gather<T, OP>, where T is the type and OP the binary operation to be performed. For an example, see the gather.cpp example program.

goopax_future

Gather values are wrapped into objects of type “goopax_future”. Their behavior is similar to “std::future” of the standard C++ library. The actual values are returned by the “goopax_future::get()” function. If the return type of the kernel function is void, goopax_future<void> is returned when calling the kernel. Any goopax_future return object can be used to query information about the number of threads in the kernel, or the execution status.

Gathering values as references

When using references, multiple gather values can be used.
Example:

kernel testprog(device, [](gather<int, std::plus<>>& sumID, gather<int, goopax::op_max>& maxID)
{
    sumID = global_id();
    maxID = global_id();
});

goopax_future<int> sumID;
goopax_future<int> maxID;
testprog(sumID, maxID);
cout << "sum of all ids=" << sumID.get() << endl
     << "maximum id=" << maxID.get() << endl;

Gathering values as return values

Returning gathered values is done as shown in the following example:

kernel testprog(device, []() -> gather_result<float>
{
    return gather_add(global_id());
});

goopax_future<float> sum = testprog();
cout << "sum of thread IDs=" << sum.get() << endl;

Data Types

Basic GPU Data Types

Special data types are used to declare local variables that reside on the video card. The programming of GPU kernels relies on these data types. The following basic types are available:

GPU type Corresponding CPU type
gpu_type<T>, gpu_T T (T is an intrinsic type)
gpu_type<S*> S* (S is an intrinsic type, or a user-defined struct)
gpu_type<const S*> const S* (S is an intrinsic type, or a user-defined struct)

Valid intrinsic types are float, double, half, any integral type (signed or unsigned, 8, 16, 32, or 64 bits), or bool.

Variables of GPU type are used in the kernel function, or as local variables in other functions and classes that are called from the kernel function. They may only be used within the call stack of the kernel function.

CPU Types and GPU Types

In kernel development, we distinguish between variables of CPU type and of GPU type. This is not to be mistaken with the device type. Regardless of the device type, a CPU variable is an ordinary variable of the main program, whereas a GPU variable is a variable (typically a register) that resides on the device. Although kernel development relies on the use of GPU types, using CPU types within a kernel can sometimes simplify programming, and improve performance, as they are treated as constant expressions from the perspective of the GPU kernel.

Both CPU and GPU types can be used together in kernel programs. Instructions that use CPU types are calculated during kernel compilation. They are the equivalent of what constant expressions are in ordinary programs, and they will not use any GPU resources.

Hence, variables can be sorted into four different types of life cycle:

Matching CPU and GPU Types

It is sometimes necessary to get the corresponding CPU type from a GPU type or vice versa, especially when the type in question is a template parameter. This can be done with the structs goopax::make_cpu, and goopax::make_gpu. For a given input type T,

typename make_cpu<T>::type

is the CPU type, and

typename make_gpu<T>::type

is the GPU type. The type T can be a fundamental type, or a goopax struct.

Changing GPU/CPU Types

When writing template functions that should be valid both for CPU types and for GPU types, it is sometimes necessary to specify a type without knowing whether it will be used by the CPU or by the GPU.

The “change_gpu_mode” struct can generate a type regardless of whether it is for CPU or for GPU context. It takes a CPU type as template argument.

  // T may be, e.g., float or gpu_float, or goopax::debug::types::Tfloat
  template <class T> struct foo
  {
    // If T is float, then D is double. If T is gpu_float, then D is gpu_double.
    using D = typename change_gpu_mode<double, T>::type;
  };

Type Conversion Rules

Implicit Type Conversion

To avoid performance pitfalls, implicit type conversion is stricter than the usual type conversion of C++. Conversion will only be done automatically if the precision of the resulting type is at least as large as of the source type. Also, type conversion is done implicitly from integral types to floating point types, but not in the other direction from floating point types to integral types. No implicit type conversion is done from signed integral type to unsigned integral type.
Some examples of implicit type conversion:

  gpu_float a = 2.5;       // ok, converting CPU double to GPU float.
  gpu_double b = -3.7;
  gpu_float c = b;         // Error: No implicit conversion from gpu_double to gpu_float.
  gpu_float c = static_cast<gpu_float>(b);  // ok, explicit type conversion.
  gpu_double d = a + b;    // ok, implicit conversion to type with higher precision.

Explicit Type Conversion

If the type conversion is not done implicitly, the type conversion can still be done explicitly in the usual C/C++ way, for example:

gpu_int64 a = 12345;
gpu_int b = (gpu_int)a;
gpu_int16 c = gpu_int16(b);
gpu_uint d = static_cast<gpu_uint>(b);

Type Conversion from CPU to GPU

Types are implicitly converted from CPU type to GPU type. The rules are slightly relaxed, as compared to GPU/GPU type conversion: CPU types can implicitly be converted to GPU types of lower precision. However, conversion from a CPU floating point type to a GPU integer type still requires explicit conversion.
Some examples:

int a = 5;
gpu_int b = a;
gpu_float c = 3.0;  // Conversion from double to gpu_float is ok.
gpu_int d = 3.0;    // Error: no implicit conversion from CPU floating point to GPU int.

Conversion in the other direction is not possible. A GPU type cannot be converted to a CPU type.

Reinterpret

Sometimes it is necessary to change the data type of a value without changing the binary content. The reinterpret function provides a general-purpose conversion mechanism. It is defined as:

template <class TO, class FROM>
  TO reinterpret(const FROM& from);

The source data is provided as a function parameter, and the destination type is provided as template argument. The “reinterpret” function can be used on various data types (GPU types, CPU types, user-defined GPU classes, pointers, buffers).
Some Examples:

gpu_int a = 5;

gpu_float b = reinterpret<gpu_float>(a);       // Reinterpreting gpu_int to gpu_float

gpu_double c = reinterpret<gpu_double>(a);     // Error: different size!

array<gpu_int, 2> v = {{2, 3}};
gpu_double d = reinterpret<gpu_double>(v);     // Array of 2 gpu_ints to gpu_double.

int i = 2;
float f = reinterpret<float>(i);               // bit-casting a CPU value.

buffer<float> bf(device, 10);
buffer<int> bi = reinterpret<buffer<int>>(bf); // changing buffer value type

gpu_type<int*> p;
gpu_type<float*> pf = reinterpret<float*>(p);  // changing pointer type

Thread Model

The major difference between CPUs and GPUs is the level of parallelism. While CPUs are designed for serial computation, GPUs work vastly parallel, with thousands of threads working in parallel on a common goal.

Thread Numbers

Kernels are executed by all GPU threads in parallel. Each thread can query its ID by the functions local_id(), group_id(), and global_id() as described below. The threads are organized into several work groups, where each work group in turn consists of several local threads. Depending on the hardware, the threads in a work group may or may not work in unison, executing SIMD instructions that apply to all threads in that group at once. In any case, threads within a work group have better means of communicating with each other than threads that are in different work groups.

image
The number of local threads per group is given by local_size(). The number of groups is given by num_groups(). The total number of threads is therefore the product

global_size() = num_groups() * local_size().

Within a kernel, the local thread id can be queried by local_id(), the group id by group_id(), and the global id by global_id(). The global id is calculated as

global_id() == group_id() * local_size() + local_id().

For simple programs that can easily be parallelized, it may be sufficient to ignore the detailed thread model and to simply assume that the total number of threads is global_size(), and that each individual thread has the ID global_id(). For more complex programs, it may be beneficial to take the thread model into account and to separate the threads into work groups.

The group size and the number of groups can vary from one video card to another. The programmer should normally not make assumptions on these values, but always query them from local_size() and num_groups(). Although it is possible to manually set these values by the member functions “force_local_size()”, “force_num_groups()”, etc (see reference guide), one should normally not do this and let Goopax decide which thread sizes to use.

How Goopax chooses suitable thread numbers

If the number of threads are not specified by the user, Goopax uses the following heuristics:

If a kernel uses only few registers, num_groups will be large. This will allow the graphics card to better hide memory latencies. if a kernel uses many registers, num_groups will be smaller.

Threads of the same Work Group

Work groups are assumed to work in sync. They cannot branch off into different if-clauses or loops. Whenever only some threads of a work group enter an if clause, the other threads in the group must wait. The same is true for loops. To optimize performance, one should usually make sure that either all threads in a work group enter an if-clause or none, and that the number of loop executions is similar or equal for all threads in the group in order to avoid waiting times.

Threads of the same work group can communicate with each other via local memory or global memory.

Different Thread Groups

Different work groups can branch off into different if-clauses or for-loops. However, they cannot easily communicate with each other within one kernel execution. Atomic operations offer some means of communication between different work groups.

Another possibility for communication between threads in different work-groups is to wait until the kernel execution has finished. Between two kernel calls it is guaranteed that all memory resources are synchronized.

Flow Control

Conditional Function

Simple if-statements can be expressed by a conditional function (the equivalent of the C/C++ conditional operator “a ? b : c”):

  T cond(gpu_bool condition, T return value if true, T return value if false)

If Clauses

For more complex statements that cannot be expressed as a conditional move, gpu_if can be used instead. Example:

gpu_if (local_id() == 0)
{
  a = 17;
}

Loops

gpu_for

Usage:

gpu_for<typename comp_t=std::less<>>(begin, end [,step], <lambda>);

Examples:

gpu_for(0, 10, [&](gpu_uint i)  // Counts from 0 to 9.
{
  ...
});

gpu_for<std::less_equal<>>(gpu_uint i=0, 10, 2)  // Counts from 0 to 10 with step 2.
{
  ...
});

gpu_for_global

The gpu_for_global loop is parallelized over all threads. It is provided for convenience.

gpu_for_global(0, N, [&](gpu_uint i)
{
  ...
});

is equivalent to

gpu_for(global_id(), N, global_size(), [&](gpu_uint i)
{
  ...
});

gpu_for_local

The gpu_for_local loop is parallelized over all threads in a work-group.

gpu_for_local(0, N, [&](gpu_uint i)
{
  ...
});

is equivalent to

gpu_for(local_id(), N, local_size(), [&](gpu_uint i)
{
  ...
});

gpu_break

Breaks out of a loop. It is called as a function of the loop variable. Usage example:

gpu_for(0, 10, [&](gpu_int i)
{
  gpu_if(i==5)
  {
    i.gpu_break();
  }
});

It is also possible to break out of a two-dimensional loop:

gpu_for(0, 10, [&](gpu_uint a)
{
  gpu_for(0, 10, [&](gpu_uint b)
  {
    gpu_if(a + b == 15)
    {
      a.gpu_break();   // Break out of both loops.
    }
  });
});

C-Style Loops

Traditional for-loops can also be used. They require that all loop boundaries are CPU values and therefore known to Goopax. The advantage is that the loop variable is also a CPU value, so that it can be used to address elements in vectors or arrays. In the following example, the sum of all elements of a 3D vector is calculated. The loop will be explicitly unrolled and all the calculation is done in registers.

std::array<gpu_float, 3> x = {{1,2,3}};    // A 3D vector

gpu_float sum = 0;
for (int k=0; k<3; ++k)
{
  sum += x[k];
}

Or, equivalently,

gpu_float sum = std::accumulate(x.begin(), x.end(), 0);

Warning: Using traditional C-style loops results in explicit loop unrolling. They should only be used in GPU kernels when the number of loop cycles are reasonably low!

Atomic Operations

Atomic memory operations are guaranteed to be indivisable and thread-safe. For example, if two threads atomically increase the value at the same memory location by 1, then the value will be increased by 2 in total, without the danger of getting into a race condition.

Atomic memory operations are supported on both global and local memory, and, if supported by the graphics card, on system memory. The referenced memory objects must be 32 bit integers, or 64 bit integers. Atomic operations on 64 bit integers is not supported on all devices.

Example:

void program(resource<Tuint>& a)
{
  // Each thread adds 5 to a[0].
  gpu_uint k = atomic_add(a[0], 5);
}

General Programming Guidelines

Programming with Goopax is generally safe. Most programming errors are detected at compile-time, some are detected at program run-time. Others can be detected by the extensive error checking mechanisms. However, there are a few things one should keep in mind.

Using existing non-goopax libraries

Many existing template-libraries can be used with goopax, by using gpu types as template arguments. This is generally safe. If the code is compiled successfully, it will work. If the function cannot be used, because it contains if-clause or loops, the compiler will produce an error.

Asynchronous calls

Calling a GPU kernel is generally asynchronous. The CPU returns immediately, and the GPU starts its computations.

Hardware-accelerated operations

Goopax will try to use hardware-optimized operations whereever possible. Some operations are not natively supported on all devices (e.g., atomic operations on type float, or thread shuffle functions). They will still work, but they may be emulated. It is the responsibility of the programmer to choose these operations wisely in order to achieve good performance.

Compiler optimizations

Many optimizations are done by the goopax compiler. Most noteworthy are common subexpression elimination, and memory access optimization. Keeping this in mind can simplify programming. It is not necessary to forcibly use additional variables, the compiler will do what it can to automatically combine common subexpressions. Multiple reads and writes to the same memory address will be combined whereever possible.

Move semantics

Goopax classes generally use C++ move semantics.

Indices

In functions that take indices (e.g., for buffer sizes, or copy operations), these indices are always in units of the type, never in bytes.

Zero-size buffers and operations

Zero-sizes are always allowed. A buffer of size zero should just not be accessed, obviously. A fill operation of zero range just does nothing.

CPU loops and Recursive Calls

Using C/C++-style loops and recursive function calls can result in very fast kernel code. However, because the resulting code is completely unrolled, this can quickly use up the available registers. Only use CPU loops if the number of iterations is small.

Break, return, continue

Use caution when using these statements in kernels.

break, continue

Never use break or continue statements within an gpu_if clause, or within GPU loops. Use gpu_break or gpu_continue instead.

return

Never use the return statement within an gpu_if clause or within GPU loops. Always return values at the end of the function.

CPU/GPU interoperability

It is often useful to define functions and classes in such a way, that they can be used both from CPU code and from GPU code, depending on the type of the template parameters. This can result in more compact, more maintenable code. Simple functions may work without modifications. If the function contains if clauses or loops, additional modifications may have to be applied. To simplify programming, certain goopax statements can be used both in CPU code and GPU code:

gpu_for, gpu_for_(global|local|group), for_each, for_each_(global_local_group)

If the loop variable is of CPU type, then these loops will behave as normal C++ loops and can be used in CPU code. The _global, _local, and _group loops are treated as their normal gpu_for or for_each pendants, respectively.

gpu_if, gpu_else, gpu_elseif

If the conditional variable can be converted to bool, these statements will be treated as normal C/C++ if/else statements.

Number of Threads

The number of threads that are used in kernel execution are typically not specified by the programmer, but are chosen by goopax, depending on how many threads work best on the GPU. The number of threads may vary from kernel to kernel, depending on register use. Upper limits are provided by global_size(), local_size(), and num_groups().

Operators and Functions

All the usual C++ operators and math functions have been overloaded for the GPU types and may freely be used. Only the conditional operator (“a ? b : c”) must be replaced by the conditional function cond.

Operators

Operators for Floating Point numbers and Integers

Operators for Integers

Boolean Operators

Floating Point Functions

Mathematical functions on GPU types can be used in the same way as they are used on CPU types, for example:

  gpu_float x = 0.5;
  gpu_float s = exp(x);

Goopax will use the best implementation for the given video card, based on performance measurements. It will select between native functions, OpenCL implementations, and Goopax implementations.

Unary Functions

Integer Functions

T clz(T)

Returns the number of leading zeros.

gpu_uint popcount(T)

Counts the number of bits set to 1.

T rotl(T a, gpu_int bits)

Rotate left. bits can be any number, positive, negative, or zero.

T rotr(T a, gpu_int bits)

Rotate right. bits can be any number, positive, negative, or zero.

Functions for Integers and Floats

gpu_T min(gpu_T a, gpu_T b)

Returns the minimum value of a and b.

gpu_T max(gpu_T a, gpu_T b)

Returns the maximum value of a and b.

gpu_TU abs(gpu_T a)

Returns the absolute value of a. If a is a signed integer, the result is an unsigned integer.

Work-Group Functions

gpu_bool work_group_any(gpu_bool x)

Returns true if x is true for any thread in the work-group.

gpu_bool work_group_all(gpu_bool x)

Returns true if x is true for all threads in the work-group.

T work_group_reduce_add(T x)

Returns the sum of all values of x in the work-group.

T work_group_reduce_min(T x)

Returns the minimum value of x in the work-group.

T work_group_reduce_max(T x)

Returns the maximum value of x in the work-group.

T work_group_broadcast(T x, gpu_uint local_id)

Broadcasts value x to thread local_id in the work-group.

work_group_scan_inclusive_(T x)

work_group_scan_exclusive_(T x)

Does a prefix-sum operation over all threads in the work-group. may be sum, min, or max.
Example:

local_id 0 1 2 3 4 …
x 1 0 2 5 1
work_group_scan_inclusive_add 1 1 3 8 9
work_group_scan_exclusive_add 0 1 1 3 8

Random Numbers

Goopax provides a WELL512 random number generator. It is very fast – all the calculation is done in registers – at the expense that the random numbers should be used in blocks.

To use the random numbers, it is first necessary to create an object of type WELL512, which contains a buffer object to store the seed values. Then, in the GPU kernel program, an object of type WELL512lib should be created, which will provide the random numbers. The constructor of WELL512lib takes the WELL512 object as input parameter.

The function WELL512lib::gen_vec<type>() returns a vector of random numbers of the specified type (which can be of floating point or integral type). The size of this vector depends on its type. It is 8 for 64-bit values, 16 for 32-bit values, and larger for smaller integer types. The best performance is achieved if all values in the vector are used before a new vector is generated. Floating point random numbers are in the range 0..1, unsigned integers from 0 to the maximum value, integers from the largest negative value to the largest positive value.

See the following example:

struct random_example :
  kernel<random_example>
{
  WELL512 rnd;                  // Object for the seed values

  void program()
  {
    WELL512lib rnd(this->rnd);  // Instantiate the random number generator

    vector<gpu_float> rnd_values = rnd.gen_vec<float>(); // Generate random floats
    ...
  }
};

Error Checking Mechanisms

Overview

Goopax offers extensive support for automatic error checking. This includes

These error checking mechanisms can be enabled to look for bugs in the program. They are not intended for use in release mode, because they cannot be run on the video card, and they also reduce performance for CPU mode.

Enabling Error Detection Mechanisms

For error detection mechanisms, special debug data types are provided in the namespace goopax::debug::types. Debug types are prefixed by “T”, e.g., “Tfloat”, “Tuint64_t”, or “Tbool”. In difference to the corresponding intrinsic types, debug types will detect and report the use of uninitialized variables. If used in buffers, race conditions will be detected.

Using the Debug Namespace

The proposed way is to import the namespace goopax::debug::types in debug mode and the namespace goopax::release::types in release mode.

Your program could start like this:

#include <goopax>

using namespace goopax;
#if USE_DEBUG_MODE
using namespace goopax::debug::types;
#else
using namespace goopax::release::types;
#endif

Here, the debug mode is enabled by the compiler switch “USE_DEBUG_MODE”.

Enabling Error Checking in Kernels

To enable error checks in the kernels, all buffer, resource, and local_mem types should use T-prefixed types, like this:

struct my_kernel :
  kernel<my_kernel>
{
  buffer<Tdouble> A;

  void program()
  {
    resource<Tdouble> A(this->A);
    local_mem<Tint> B(2*local_size());
    ...
  }
  ...
};
...

This will enable all error checks in the kernels, if the data types from the debug namespace are used.

Extending the Error Checks to the CPU Program

Error checking mechanism can also be used in the CPU program, by using the T-prefixed data types throughout the program, instead of intrinsic c++ data types. This will provide extensive checks for variable initialization.

This should work in most cases. However, be aware that this may cause compilation errors from time to time (for example, the main function requires plain C++ types, as do constant expressions). Such errors have to be resolved by explicit type conversions or by reverting to the C++ intrinsic types.

Running Programs in Debug Mode

The debug mode is only available on the CPU device. The CPU device can be selected with env_CPU when calling the functions default_device or devices. By default, the number of threads is equal to the number of available CPU cores, and the number of groups is 1. The number of threads can be changed by calling the force_local_size and force_num_group member functions of the device. This may be helpful to mimic the behavior of the video card.

Debugging Errors

When an error is detected, an appropriate error message is generated and the program is terminated. A debugger can be used to pin down the point where the error occurs.

To do this, it is helpful to disable optimization and to enable debugging symbols by passing appropriate compiler options with the GOOPAX_CXXADD environment variable.
Example:

GOOPAX_CXXADD='-O0 -ggdb3' gdb ./my_program

OpenCL Interoperability

Goopax can be used in conjunction with existing OpenCL code. For this to work, the same OpenCL platform, context, and device must be used in Goopax and in your OpenCL code, and the same OpenCL queue should be shared between OpenCL and Goopax. Memory buffers can then be shared between Goopax and OpenCL.

To use OpenCL Interoperability, the header file <goopax_cl> must be included:

#include <goopax_cl>

To see how OpenCL interoperability is applied, also see the example programs “cl_interop_1” and “cl_interop-2”.

Accessing Goopax Resources from OpenCL

The following functions provide access to Goopax resources from OpenCL code.

Platform:

 

  cl_platform_id get_cl_platform()

Returns the OpenCL platform that is used by Goopax.

Context:

 

  cl_context get_cl_context()
  cl::Context get_cl_cxx_context()

Returns the OpenCL context that is used by Goopax.

Device:

 

  cl_device_id get_cl_device()

Returns the OpenCL device that is used by Goopax.

Buffers and Images:

 

  template <class BUF> inline cl_mem get_cl_mem(const BUF& buf)
  template <class BUF> inline cl::Buffer get_cl_cxx_buf(const BUF& buf)

Returns the OpenCL memory handle for the Goopax buffer or image.

OpenGL Interoperability

Goopax can share memory resources with OpenGL.

OpenGL Initialization

To enable OpenGL support, the goopax device must be retrieved from the get_device_from_gl function.

Sharing Buffers and Images with OpenGL

Goopax images and buffers can be created from existing OpenGL objects by using the static member functions “create_from_gl”.

buffer::create_from_gl(goopax_device device, GLuint GLres, uint64_t cl_flags=CL_MEM_READ_WRITE)
  image_buffer::create_from_gl(GLuint GLres, uint64_t cl_flag=CL_MEM_READ_WRITE,
                               GLuint GLtarget=GL_TEXTURE_2D, GLint miplevel=0)
GLres:

The OpenGL object ID.

cl_flags:

The OpenCL access mode.

GLtarget:

The OpenGL object type.

GLint:

The OpenGL miplevel

For example, if “gl_id” is the ID of a 2-dimensional OpenGL texture with 32 bit data size, then

  image_buffer<2, uint32_t> A = image_buffer<2, uint32_t>::create_from_gl(gl_id)

creates an image “A” that can be used with Goopax.

Environment Variables

The following environment variables will be used by the goopax library at runtime:

environment variable meaning default
GOOPAX_ENV backends to use. 0xffffffff (env_ALL)
GOOPAX_OPTIMIZE enable kernel optimization 1
GOOPAX_VERB verbosity level (0 or 1) 0
GOOPAX_USE_BINARY use precompiled binary kernels 1
GOOPAX_LDFLAGS additional linker flags for “”
CPU backend
GOOPAX_CXXFLAGS Compiler flags for CPU backend system dependend
GOOPAX_CXXADD Additional compiler flags for “”
CPU backend
GOOPAX_CXX Compiler for CPU backend cl” in windows, “c++” otherwise
GOOPAX_CLFLAGS Compiler flags for CL backend -cl-std=CL2.0 if supported
GOOPAX_CL_DEVICE_TYPE Device types to use by CL backend CL_DEVICE_TYPE_GPU
| CL_DEVICE_TYPE_ACCELERATOR
GOOPAX_CUDAFLAGS Compiler flags for CUDA backend “”
GOOPAX_CUDA_INCLUDE_PATH Path to CUDA include directory “” (autodetect)
CUDA_PATH Path to nvrtc*.dll on windows OS “” (autodetect)
GOOPAX_OSTREAM_BUFFER_SIZE gpu_ostream buffer size in bytes 4MB on apple systems, 16MB otherwise
GOOPAX_COROUTINES Use coroutines if supported 0
GOOPAX_THREAD_STACKSIZE stack size for cpu threads 0 (which means system default)

Example Programs

The source code of these programs is included in the goopax package.

pi

This program approximates the value of π in a very simple way: It uses a WELL512 random number generator to produce points in a 2-dimensional space 0 < x < 1 and 0 < y < 1 and counts, how many of those points lie within a circle of radius 1 from the origin, i.e. x2 + y2 < 1. This fraction multiplied by 4 approximates the value π. This program supports MPI and can be run with the mpirun command to combine the computing power of several hosts or video cards.

Mandelbrot

This program calculates a Mandelbrot image and uses OpenGL or Metal to draw it on the screen. It is an interactive program that can be controlled with the left mouse button and the forward and backward arrow keys.

Deep Zoom Mandelbrot

This is a somewhat more complex Mandelbrot program, combining arbitrary-precision arithmetics on the CPU with a special algorithm of our design. It allows to zoom in to large magnification levels (factor 1080 and more), and still display the result in real time.

fft

This program applies Fourier transforms on live camera images, and filters out high frequencies or low frequencies.

nbody

N-body simulation. Two colliding galaxies are simulated, each consisting of particles. Each particle interacts via gravity with all other particles. The particles are displayed via OpenGL or Metal.

matmul

The example program ’matmul’ performs matrix multiplications. It uses the naïve multiplication algorithm. For productive purposes, faster algorithms such as Strassen’s algorithm should be considered as well. The major performance bottleneck is the memory access. Three different algorithms are implemented that use different techniques to reduce the access to global memory.