Back to Laws, Sausages and ConvNets.

# 5. Multidimensionality

So far, only 1-dimensional convolutions were considered. Most of the insights regarding one-dimensional convolutions can be applied as-is for multidimensional convolutions (which are the common case in practice). Yet, there are some additional issues that are associated with increasing the dimension.

They come in two flavours: technical and algorithmic. The technical issues this section addresses are image memory objects and shared-memory bank conflicts, and the algorithmic issues revolve around ways to compute multidimensional convolutions via 1-dimensional convolutions, and include separable filters, the row-column method, and the Helix transform. For readability, I shall focus on the 2-dimensional case.

## 5.1. Separability

As for algorithmic issues for multidimensional layers, let's start with
**separable filters**. Those
involve multidimensional convolutions that can be done by performing
1-dimensional convolutions separably on each dimension. On the 2-dimensional
case, it's simply means that the template-matrix is of rank 1. Now, very much
depending on the context in which the convolutional layer in used in the neural
network, it could be the case that it makes sense to learn such filters (for
example, it's possible to learn an edge-detector for input images that way).

This is a design issue, and it's implemented by maintaining 2 weight-vectors pf length $(2M+1)$ instead of 1 weight-matrix of order $(2M+1)^2$. The outer- product of those vectors will be used as the theoretical template. In practice, though, such layers can be implemented as 2-successive 1-dimensional layers: the first operates on rows and the second operates on cols. Such a separation makes backpropagation simpler.

## 5.2. The Row-Column Method and The Helix Transform

Let $a\in C^{N_1\times N_2}$, and denote $\vec{a}_n$ the $n$-th row of $a$ (which is a vector of length $N_2$). Then by definition it's 2-dimensional discrete Fourier transform is - $$\mathcal{F}[a]_{n_1,n_2}:=\sum_{k_1=0}^{N_1-1}\sum_{k_2=0}^{N_2-1}a_{k_1,k_2}\omega_{N_1}^{n_1k_1}\omega_{N_2}^{n_2k_2}=\sum_{k_1=0}^{N_1-1}\omega_{N_1}^{n_1k_1}(\sum_{k_2=0}^{N_2-1}a_{k_1,k_2}\omega_{N_2}^{n_2k_2})=\sum_{k_1=0}^{N_1-1}\mathcal{F}[\vec{a}_{k_1}]_{k_2})\omega_{N_1}^{n_1k_1}$$

So instead of a 2-dimensional DFT, we can perform $N_1$ 1-dimensional DFTs over $a$'s rows, and then another $N_2$ 1-dimensional DFTs over the columns of the transformed matrix.

This is the "row-column method". Apart from the much-appreciated option to reuse efficient 1-dimensional code for multidimensional problems, this method is remarkably efficient asymptotically: by using FFTs we can perform 1-dimensional DFT in $O(N\log{N})$ steps instead of $O(N^2)$ steps. So the row-column method allow us use the same algorithm to perform 2-dimensional DFT in $O(N^2\log{N})$ steps instead of $O(N^4)$ steps. If follow that this method makes 2-dimensional FFTs more relatively-efficient than 1-dimensional FFTs.

**In [1]:**

**In [2]:**

Not surprisingly, again the complications of writing an efficient parallelization of the row-column method is essentially due to the "transpose" operation between the rows-phase and the columns-phase.

A very similar method for translating a multidimensional convolution into a 1-dimensional convolution is The Helix Transform. This is a fancy name for treating the matrices involved as vectors with column-major layout.

**In [3]:**

## 5.3. Image Memory Objects

Previously, the input for the kernel was placed inside a `__global`

memory
buffer, and cached into the local memory by the `cache_tile`

method. But it
could've been better had we've used **image memory objects** instead. Those are
opaque memory objects that can be accesses via coordinates much like global
multidimensional arrays.They are restricted to 1, 2 or 3 dimensions. This is
hardly a problem for most current use-cases of convolutional networks, but it is
a restriction nonetheless.

The benefits they provide include optimized caching of multidimensional data and automatic handling of out-of-bounds reads. Their opaqueness allows the runtime to do some neat things, such as directly loading 2-dimensional arrays from the host into a Morton-order layout on the device, to improve data locality.

A 2-dimensional image-based code analogues to the 1-dimensional tiled convolution from before, would like that:

Note that time no `__local`

tiling are used, since image memory objects
reside in the `__global`

memory. The caching and opaque (hopefully locality-
friendly) layout supposed to make for it. And there are no special handling for
boundaries: the `sampler`

takes care of it. The host-side code should create
the sampler using a code similar to this:

where the parameter `CL_ADDRESS_CLAMP_TO_EDGE`

determines the boundaries
policy (when using an older version than `OpenCL 2.0`

, use the deprecated
`clCreateSampler`

instead of `clCreateSamplerWithProperties`

). Note the
usage of `float4`

, which is there because the interface of image-memory
objects always assumes 4-channels - even when just 1 is used, as in our case
here.

## 5.4. Shared-Memory Bank Conflicts

If instead a `__local`

memory buffer is used to cache tiles in a
multidimensional convolution, then another micro-optimization may become
relevant: **shared-memory bank conflicts**. On most architectures, sequential
words from the local memory are dealt cyclically by sequential memory banks.
This means that upon accessing the local memory, the different work-items within
a frontwaves should address words that belong to different banks or otherwise
incur latencies.

Many times bank conflicts are not a big-deal and not worth dealing with, especially when there are many active work-groups, so some can work while others wait (this is called "latency hiding"). At other times, the problem can be prevented by a good choice of memory layouts (e.g. Morton-order). Anyway, in the case of tiled multidimensional convolutions it is always very simple to solve by just making sure that the tile width (on which we can control) is not evenly divisible by the number of shared memory banks.