API Documentation

This file provides all the details you need about how to call the Kernel Tuner’s functions, including all the optional arguments.

kernel_tuner.tune_kernel(kernel_name, kernel_source, problem_size, arguments, tune_params, grid_div_x=None, grid_div_y=None, grid_div_z=None, restrictions=None, answer=None, atol=1e-06, verify=None, verbose=False, lang=None, device=0, platform=0, smem_args=None, cmem_args=None, texmem_args=None, compiler=None, compiler_options=None, defines=None, log=None, iterations=7, block_size_names=None, quiet=False, strategy=None, strategy_options=None, cache=None, metrics=None, simulation_mode=False, observers=None, objective=None, objective_higher_is_better=None)

Tune a CUDA kernel given a set of tunable parameters

Parameters:
  • kernel_name (string) – The name of the kernel in the code.

  • kernel_source (string or list and/or callable) –

    The CUDA, OpenCL, HIP, or C kernel code. It is allowed for the code to be passed as a string, a filename, a function that returns a string of code, or a list when the code needs auxilliary files.

    To support combined host and device code tuning, a list of filenames can be passed. The first file in the list should be the file that contains the host code. The host code is assumed to include or read in any of the files in the list beyond the first. The tunable parameters can be used within all files.

    Another alternative is to pass a code generating function. The purpose of this is to support the use of code generating functions that generate the kernel code based on the specific parameters. This function should take one positional argument, which will be used to pass a dict containing the parameters. The function should return a string with the source code for the kernel.

  • lang (string) – Specifies the language used for GPU kernels. The kernel_tuner automatically detects the language, but if it fails, you may specify the language using this argument, currently supported: “CUDA”, “Cupy”, “OpenCL”, “HIP”, or “C”.

  • problem_size (callable, string, int, or tuple(int or string, ..)) –

    The size of the domain from which the grid dimensions of the kernel are computed.

    This can be specified using an int, string, function, or 1,2,3-dimensional tuple.

    In general, do not divide the problem_size yourself by the thread block sizes. Kernel Tuner does this for you based on tunable parameters, called “block_size_x”, “block_size_y”, and “block_size_z”. If more or different parameters divide the grid dimensions use grid_div_x/y/z options to specify this.

    In most use-cases the problem_size is specified using a single integer or a tuple of integers, but Kernel Tuner supports more advanced use cases where the problem_size itself depends on the tunable parameters in some way.

    You are allowed to use a function or string to specify the problem_size. A function should accept a dictionary with the tunable parameters for this kernel configuration and directly return a tuple that specifies the problem size in all dimensions.

    When passing a string, you are allowed to write Python arithmetic and use the names of tunable parameters as variables in these expressions. Kernel Tuner will replace instances of the tunable parameters with their current value when computing the grid dimensions. This option exists for convenience, but do note that using a lambda function is probably safer. The string notation should only return the problem size for one dimension, but can be used inside a tuple, possibly in combination with integers or more strings in different dimensions.

    See the reduction CUDA example for an example use of this feature.

  • arguments (list) – A list of kernel arguments, use numpy arrays for arrays, use numpy.int32 or numpy.float32 for scalars.

  • grid_div_x (callable or list) –

    A list of names of the parameters whose values divide the grid dimensions in the x-direction. The product of all grid divisor expressions is computed before dividing the problem_size in that dimension. Also note that the divison is treated as a float divison and resulting grid dimensions will be rounded up to the nearest integer number.

    Arithmetic expressions can be used if necessary inside the string containing a parameter name. For example, in some cases you may want to divide the problem size in the x-dimension with the number of warps rather than the number of threads in a block, in such cases one could for example use [“block_size_x/32”]. Another option is to pass a function to grid_div_x that accepts a dictionary with the tunable parameters and returns the grid divisor in this dimension, for example: grid_div_x=lambda p:p[“block_size_x”]/32.

    If not supplied, [“block_size_x”] will be used by default, if you do not want any grid x-dimension divisors pass an empty list.

  • grid_div_y (list) – A list of names of the parameters whose values divide the grid dimensions in the y-direction, [“block_size_y”] by default. If you do not want to divide the problem_size, you should pass an empty list. See grid_div_x for more details.

  • grid_div_z (list) – A list of names of the parameters whose values divide the grid dimensions in the z-direction, [“block_size_z”] by default. If you do not want to divide the problem_size, you should pass an empty list. See grid_div_x for more details.

  • smem_args (dict(string: numpy object)) – CUDA-specific feature for specifying shared memory options to the kernel. At the moment only ‘size’ is supported, but setting the shared memory configuration on Kepler GPUs for example could be added in the future. Size should denote the number of bytes for to use when dynamically allocating shared memory.

  • cmem_args (dict(string: numpy object)) – CUDA-specific feature for specifying constant memory arguments to the kernel. In OpenCL these are handled as normal kernel arguments, but in CUDA you can copy to a symbol. The way you specify constant memory arguments is by passing a dictionary with strings containing the constant memory symbol name together with numpy objects in the same way as normal kernel arguments.

  • texmem_args (dict(string: numpy object or dict)) – CUDA-specific feature for specifying texture memory arguments to the kernel. You specify texture memory arguments by passing a dictionary with strings containing the texture reference name together with the texture contents. These contents can be either simply a numpy object, or a dictionary containing the numpy object under the key ‘array’ plus the configuration options ‘filter_mode’ (‘point’ or ‘linear), ‘address_mode’ (a list of ‘border’, ‘clamp’, ‘mirror’, ‘wrap’ per axis), ‘normalized_coordinates’ (True/False).

  • block_size_names (list(string)) – A list of strings that replace the defaults for the names that denote the thread block dimensions. If not passed, the behavior defaults to ["block_size_x", "block_size_y", "block_size_z"]

  • defines (dict) – A dictionary containing the preprocessor definitions inserted into the source code. The keys should the definition names and each value should be either a string or a function that returns a string. If an emtpy dictionary is passed, no definitions are inserted. If None is passed, each tunable parameter is inserted as a preprocessor definition.

  • tune_params (dict( string : [...]) –

    A dictionary containing the parameter names as keys, and lists of possible parameter settings as values. Kernel Tuner will try to compile and benchmark all possible combinations of all possible values for all tuning parameters. This typically results in a rather large search space of all possible kernel configurations.

    For each kernel configuration, each tuning parameter is replaced at compile-time with its current value. Currently, Kernel Tuner uses the convention that the following list of tuning parameters are used as thread block dimensions:

    • ”block_size_x” thread block (work group) x-dimension

    • ”block_size_y” thread block (work group) y-dimension

    • ”block_size_z” thread block (work group) z-dimension

    Options for changing these defaults may be added later. If you don’t want the thread block dimensions to be compiled in, you may use the built-in variables blockDim.xyz in CUDA or the built-in function get_local_size() in OpenCL instead.

  • restrictions (callable or list(strings)) – An option to limit the search space with restrictions. The restrictions can be specified using a function or a list of strings. The function should take one argument, namely a dictionary with the tunable parameters of the kernel configuration, if the function returns True the configuration is considered to be part of the search space, or False otherwise. The other way to specify restrictions is using a list of strings containing boolean expression that must be satisfied by the kernel configuration. These expressions must all be true for the configuration to be part of the search space. For example: restrictions=[“block_size_x==block_size_y*tile_size_y”] limits the search to configurations where the block_size_x equals the product of block_size_y and tile_size_y. The default is None.

  • answer (list) – A list of arguments, similar to what you pass to arguments, that contains the expected output of the kernel after it has executed and contains None for each argument that is input-only. The expected output of the kernel will then be used to verify the correctness of each kernel in the parameter space before it will be benchmarked.

  • atol (float) – The maximum allowed absolute difference between two elements in the output and the reference answer, as passed to numpy.allclose(). Ignored if you have not passed a reference answer. Default value is 1e-6, that is 0.000001.

  • verify (func(ref, ans, atol=None)) –

    Python function used for output verification. By default, numpy.allclose is used for output verification, if this does not suit your application, you can pass a different function here.

    The function is expected to have two positional arguments. The first is the reference result, the second is the output computed by the kernel being verified. The types of these arguments depends on the type of the output arguments you are verifying. The function may also have an optional argument named atol, to which the value will be passed that was specified using the atol option to tune_kernel. The function should return True when the output passes the test, and False when the output fails the test.

  • strategy

    Specify the strategy to use for searching through the parameter space, choose from:

    • ”basinhopping” Basin Hopping

    • ”bayes_opt” Bayesian Optimization

    • ”brute_force” (default) iterates through the entire search space

    • ”minimize” uses a local minimization algorithm

    • ”dual annealing” dual annealing

    • ”diff_evo” differential evolution

    • ”firefly_algorithm” firefly algorithm strategy

    • ”genetic_algorithm” a genetic algorithm optimization

    • ”greedy_ils” greedy randomized iterative local search

    • ”greedy_mls” greedy randomized multi-start local search

    • ”mls” best-improvement multi-start local search

    • ”ordered_greedy_mls” multi-start local search that uses a fixed order

    • ”pso” particle swarm optimization

    • ”random_sample” takes a random sample of the search space

    • ”simulated_annealing” simulated annealing strategy

    Strategy-specific parameters and options are explained under strategy_options.

  • strategy_options (dict) –

    A dict with options specific to the selected tuning strategy.

    All strategies support the following two options:

    1. “max_fevals”: the maximum number of unique valid function evaluations (i.e. compiling and benchmarking a kernel configuration the strategy is allowed to perform as part of the optimization. Note that some strategies implement a default max_fevals of 100.

    2. “time_limit”: the maximum amount of time in seconds the strategy is allowed to spent on trying to find the optimal kernel configuration. There is no default time limit.

    Strategy specific options are explained in Optimization strategies.

  • iterations (int) – The number of times a kernel should be executed and its execution time measured when benchmarking a kernel, 7 by default.

  • objective (string) – Optimization objective to sort results on, consisting of a string that also occurs in results as a metric or observed quantity, default ‘time’. Please see Tuning Objectives.

  • objective_higher_is_better (bool) – boolean that specifies whether the objective should be maximized (True) or minimized (False), default False.

  • verbose (bool) –

    Sets whether or not to report about configurations that were skipped during the search. This could be due to several reasons:

    • kernel configuration fails one or more restrictions

    • too many threads per thread block

    • too much shared memory used by the kernel

    • too many resources requested for launch

    verbose is False by default.

  • cache (string) – Filename for the cache to persistently store benchmarked configurations. Filename uses suffix “.json”, which is appended if missing. If the file exists, it is read and tuning continues from this file. Please see Cache files.

  • metrics (dict) – specifies user-defined metrics, please see Metrics.

  • simulation_mode (bool) – Simulate an auto-tuning search from an existing cachefile

  • observers (list) – A list of Observers to use during tuning, please see Observers.

  • device (int) – CUDA/OpenCL device to use, in case you have multiple CUDA-capable GPUs or OpenCL devices you may use this to select one, 0 by default. Ignored if you are tuning host code by passing lang=”C”.

  • platform (int) – OpenCL platform to use, in case you have multiple OpenCL platforms you may use this to select one, 0 by default. Ignored if not using OpenCL.

  • quiet (boolean) – Control whether or not to print to the console which device is being used, False by default

  • compiler (string) – A string containing your preferred compiler, only effective with lang=”C”.

  • compiler_options (list(string)) – A list of strings that specify compiler options.

Returns:

A list of dictionaries of all executed kernel configurations and their execution times. And a dictionary with information about the environment in which the tuning took place. This records device name, properties, version info, and so on.

Return type:

list(dict()), dict()

kernel_tuner.run_kernel(kernel_name, kernel_source, problem_size, arguments, params, grid_div_x=None, grid_div_y=None, grid_div_z=None, lang=None, device=0, platform=0, smem_args=None, cmem_args=None, texmem_args=None, compiler=None, compiler_options=None, defines=None, block_size_names=None, quiet=False, log=None)

Compile and run a single kernel

Compiles and runs a single kernel once, given a specific instance of the kernels tuning parameters. However, instead of measuring execution time run_kernel returns the output of the kernel. The output is returned as a list of numpy arrays that contains the state of all the kernel arguments after execution on the GPU.

To summarize what this function will do for you in one call:
  • Compile the kernel according to the set of parameters passed

  • Allocate GPU memory to hold all kernel arguments

  • Move the all data to the GPU

  • Execute the kernel on the GPU

  • Copy all data from the GPU back to the host and return it as a list of Numpy arrays

This function was added to Kernel Tuner mostly to allow easy testing for kernel correctness. On purpose, the interface is a lot like tune_kernel().

Parameters:
  • kernel_name (string) – The name of the kernel in the code.

  • kernel_source (string or list and/or callable) –

    The CUDA, OpenCL, HIP, or C kernel code. It is allowed for the code to be passed as a string, a filename, a function that returns a string of code, or a list when the code needs auxilliary files.

    To support combined host and device code tuning, a list of filenames can be passed. The first file in the list should be the file that contains the host code. The host code is assumed to include or read in any of the files in the list beyond the first. The tunable parameters can be used within all files.

    Another alternative is to pass a code generating function. The purpose of this is to support the use of code generating functions that generate the kernel code based on the specific parameters. This function should take one positional argument, which will be used to pass a dict containing the parameters. The function should return a string with the source code for the kernel.

  • lang (string) – Specifies the language used for GPU kernels. The kernel_tuner automatically detects the language, but if it fails, you may specify the language using this argument, currently supported: “CUDA”, “Cupy”, “OpenCL”, “HIP”, or “C”.

  • problem_size (callable, string, int, or tuple(int or string, ..)) –

    The size of the domain from which the grid dimensions of the kernel are computed.

    This can be specified using an int, string, function, or 1,2,3-dimensional tuple.

    In general, do not divide the problem_size yourself by the thread block sizes. Kernel Tuner does this for you based on tunable parameters, called “block_size_x”, “block_size_y”, and “block_size_z”. If more or different parameters divide the grid dimensions use grid_div_x/y/z options to specify this.

    In most use-cases the problem_size is specified using a single integer or a tuple of integers, but Kernel Tuner supports more advanced use cases where the problem_size itself depends on the tunable parameters in some way.

    You are allowed to use a function or string to specify the problem_size. A function should accept a dictionary with the tunable parameters for this kernel configuration and directly return a tuple that specifies the problem size in all dimensions.

    When passing a string, you are allowed to write Python arithmetic and use the names of tunable parameters as variables in these expressions. Kernel Tuner will replace instances of the tunable parameters with their current value when computing the grid dimensions. This option exists for convenience, but do note that using a lambda function is probably safer. The string notation should only return the problem size for one dimension, but can be used inside a tuple, possibly in combination with integers or more strings in different dimensions.

    See the reduction CUDA example for an example use of this feature.

  • arguments (list) – A list of kernel arguments, use numpy arrays for arrays, use numpy.int32 or numpy.float32 for scalars.

  • grid_div_x (callable or list) –

    A list of names of the parameters whose values divide the grid dimensions in the x-direction. The product of all grid divisor expressions is computed before dividing the problem_size in that dimension. Also note that the divison is treated as a float divison and resulting grid dimensions will be rounded up to the nearest integer number.

    Arithmetic expressions can be used if necessary inside the string containing a parameter name. For example, in some cases you may want to divide the problem size in the x-dimension with the number of warps rather than the number of threads in a block, in such cases one could for example use [“block_size_x/32”]. Another option is to pass a function to grid_div_x that accepts a dictionary with the tunable parameters and returns the grid divisor in this dimension, for example: grid_div_x=lambda p:p[“block_size_x”]/32.

    If not supplied, [“block_size_x”] will be used by default, if you do not want any grid x-dimension divisors pass an empty list.

  • grid_div_y (list) – A list of names of the parameters whose values divide the grid dimensions in the y-direction, [“block_size_y”] by default. If you do not want to divide the problem_size, you should pass an empty list. See grid_div_x for more details.

  • grid_div_z (list) – A list of names of the parameters whose values divide the grid dimensions in the z-direction, [“block_size_z”] by default. If you do not want to divide the problem_size, you should pass an empty list. See grid_div_x for more details.

  • smem_args (dict(string: numpy object)) – CUDA-specific feature for specifying shared memory options to the kernel. At the moment only ‘size’ is supported, but setting the shared memory configuration on Kepler GPUs for example could be added in the future. Size should denote the number of bytes for to use when dynamically allocating shared memory.

  • cmem_args (dict(string: numpy object)) – CUDA-specific feature for specifying constant memory arguments to the kernel. In OpenCL these are handled as normal kernel arguments, but in CUDA you can copy to a symbol. The way you specify constant memory arguments is by passing a dictionary with strings containing the constant memory symbol name together with numpy objects in the same way as normal kernel arguments.

  • texmem_args (dict(string: numpy object or dict)) – CUDA-specific feature for specifying texture memory arguments to the kernel. You specify texture memory arguments by passing a dictionary with strings containing the texture reference name together with the texture contents. These contents can be either simply a numpy object, or a dictionary containing the numpy object under the key ‘array’ plus the configuration options ‘filter_mode’ (‘point’ or ‘linear), ‘address_mode’ (a list of ‘border’, ‘clamp’, ‘mirror’, ‘wrap’ per axis), ‘normalized_coordinates’ (True/False).

  • block_size_names (list(string)) – A list of strings that replace the defaults for the names that denote the thread block dimensions. If not passed, the behavior defaults to ["block_size_x", "block_size_y", "block_size_z"]

  • defines (dict) – A dictionary containing the preprocessor definitions inserted into the source code. The keys should the definition names and each value should be either a string or a function that returns a string. If an emtpy dictionary is passed, no definitions are inserted. If None is passed, each tunable parameter is inserted as a preprocessor definition.

  • params (dict( string: int )) – A dictionary containing the tuning parameter names as keys and a single value per tuning parameter as values.

  • device (int) – CUDA/OpenCL device to use, in case you have multiple CUDA-capable GPUs or OpenCL devices you may use this to select one, 0 by default. Ignored if you are tuning host code by passing lang=”C”.

  • platform (int) – OpenCL platform to use, in case you have multiple OpenCL platforms you may use this to select one, 0 by default. Ignored if not using OpenCL.

  • quiet (boolean) – Control whether or not to print to the console which device is being used, False by default

  • compiler (string) – A string containing your preferred compiler, only effective with lang=”C”.

  • compiler_options (list(string)) – A list of strings that specify compiler options.

Returns:

A list of numpy arrays, similar to the arguments passed to this function, containing the output after kernel execution.

Return type:

list

kernel_tuner.store_results(results_filename, kernel_name, kernel_string, tune_params, problem_size, results, env, top=3, objective=None, objective_higher_is_better=None)

stores tuning results to a JSON file

Stores the top (3% by default) best kernel configurations in a JSON file. The results are stored for a specific device (retrieved using env[‘device_name’]) and for a specific problem_size. If the file already exists, new results for this device and problem_size will be appended. Any previous results already stored in the file for this specific device and problem_size will be overwritten.

Parameters:
  • results_filename (string) – Filename of the JSON file in which the results will be stored. Results will be appended if the file already exists. Existing results within the file for the same device and problem_size will be overwritten.

  • tune_params (dict) – The tunable parameters of this kernel.

  • problem_size (tuple) – The problem_size this kernel was tuned for

  • results (list(dict)) – A list of dictionaries of all executed kernel configurations and their execution times, and possibly other user-defined metrics, as returned by tune_kernel().

  • env (dict) – A dictionary with information about the environment in which the tuning took place. This records device name, properties, version info, and so on. Typicaly this dictionary is returned by tune_kernel().

  • top (float) – Denotes the top percentage of results to store in the results file

  • objective (string) – Optimization objective to sort results on, consisting of a string that also occurs in results as a metric.

  • objective_higher_is_better (bool) – A boolean that specifies whether the objective should be maximized or minimized.

kernel_tuner.create_device_targets(header_filename, results_filename, objective=None, objective_higher_is_better=None)

create a header with device targets

This function generates a header file with device targets for compiling a kernel with different parameters on different devices. The tuning results are stored in a JSON file created by store_results. Existing header_filename will be overwritten.

This function only creates device targets and does not create problem_size specific targets. Instead it searches for configurations that perform well for different problem sizes and selects a single configuration to use for the kernel.

The header file can be included in a kernel source file using: #include "header_filename.h"

The kernel can then be compiled for a specific device using: -DTARGET_GPU="name_of_gpu"

The header will also include a default value, which is chosen to perform well on different devices.

Parameters:
  • header_filename (string) – Filename of the to be created header file.

  • results_filename (string) – Filename of the JSON file that stores the tuning results.

  • objective (string) – Optimization objective to sort results on, consisting of a string that also occurs in results as a metric.

  • objective_higher_is_better (bool) – A boolean that specifies whether the objective should be maximized or minimized.