277 lines
8 KiB
ReStructuredText
277 lines
8 KiB
ReStructuredText
|
|
Advanced Examples
|
|
==================
|
|
|
|
The power of Kompute comes in when the interface is used for complex computations. This section contains an outline of the advanced / end-to-end examples available.
|
|
|
|
Currently the advanced examples available include:
|
|
|
|
* `Machine Learning Logistic Regression Implementation <https://towardsdatascience.com/machine-learning-and-data-processing-in-the-gpu-with-vulkan-kompute-c9350e5e5d3a`_
|
|
* `Android NDK Mobile Kompute ML Application <https://towardsdatascience.com/gpu-accelerated-machine-learning-in-your-mobile-applications-using-the-android-ndk-vulkan-kompute-1e9da37b7617`_
|
|
* `Game Development Kompute ML in Godot Engine <https://towardsdatascience.com/supercharging-game-development-with-gpu-accelerated-ml-using-vulkan-kompute-the-godot-game-engine-4e75a84ea9f0`_
|
|
|
|
Below there is also a simple implementation of the logistic regression algorithm using Vulkan Kompute.
|
|
|
|
Logistic Regression Example
|
|
------------------
|
|
|
|
Logistic regression is oftens seen as the hello world in machine learning so we will be using it for our examples.
|
|
|
|
.. image:: ../images/logistic-regression.jpg
|
|
:width: 300px
|
|
|
|
In summary, we have:
|
|
|
|
* Vector `X` with input data (with a pair of inputs `Xi` and `Xj`)
|
|
* Output `Y` with expected predictions
|
|
|
|
With this we will:
|
|
|
|
* Optimize the function simplified as `Y = WX + b`
|
|
* We'll want our program to learn the parameters `W` and `b`
|
|
|
|
Converting to Kompute Terminology
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
We will have to convert this into Kompute terminology.
|
|
|
|
First specifically around the inputs, we will be using the following:
|
|
|
|
* Two vertors for the variable `X`, vector `Xi` and `Xj`
|
|
* One vector `Y` for the true predictions
|
|
* A vector `W` containing the two input weight values to use for inference
|
|
* A vector `B` containing a single input parameter for `b`
|
|
|
|
.. code-block:: cpp
|
|
:linenos:
|
|
|
|
std::vector<float> wInVec = { 0.001, 0.001 };
|
|
std::vector<float> bInVec = { 0 };
|
|
|
|
std::shared_ptr<kp::Tensor> xI{ new kp::Tensor({ 0, 1, 1, 1, 1 })};
|
|
std::shared_ptr<kp::Tensor> xJ{ new kp::Tensor({ 0, 0, 0, 1, 1 })};
|
|
|
|
std::shared_ptr<kp::Tensor> y{ new kp::Tensor({ 0, 0, 0, 1, 1 })};
|
|
|
|
std::shared_ptr<kp::Tensor> wIn{
|
|
new kp::Tensor(wInVec, kp::Tensor::TensorTypes::eStaging)};
|
|
|
|
std::shared_ptr<kp::Tensor> bIn{
|
|
new kp::Tensor(bInVec, kp::Tensor::TensorTypes::eStaging)};
|
|
|
|
|
|
We will have the following output vectors:
|
|
|
|
* Two output vectors `Wi` and `Wj` to store all the deltas to perform gradient descent on W
|
|
* One output vector `Bout` to store all the deltas to perform gradient descent on B
|
|
|
|
.. code-block:: cpp
|
|
:linenos:
|
|
|
|
std::shared_ptr<kp::Tensor> wOutI{ new kp::Tensor({ 0, 0, 0, 0, 0 })};
|
|
std::shared_ptr<kp::Tensor> wOutJ{ new kp::Tensor({ 0, 0, 0, 0, 0 })};
|
|
|
|
std::shared_ptr<kp::Tensor> bOut{ new kp::Tensor({ 0, 0, 0, 0, 0 })};
|
|
|
|
|
|
For simplicity we will store all the tensors inside a params variable:
|
|
|
|
.. code-block:: cpp
|
|
:linenos:
|
|
|
|
std::vector<std::shared_ptr<kp::Tensor>> params =
|
|
{xI, xJ, y, wIn, wOutI, wOutJ, bIn, bOut};
|
|
|
|
|
|
Now that we have the inputs and outputs we will be able to use them in the processing. The workflow we will be using is the following:
|
|
|
|
1. Create a Sequence to record and submit GPU commands
|
|
2. Submit OpCreateTensor to create all the tensors
|
|
3. Record the OpAlgo with the Logistic Regresion shader
|
|
4. Loop across number of iterations:
|
|
4-a. Submit algo operation on LR shader
|
|
4-b. Re-calculate weights from loss
|
|
5. Print output weights and bias
|
|
|
|
1. Create a sequence to record and submit GPU commands
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. code-block:: cpp
|
|
:linenos:
|
|
|
|
kp::Manager mgr;
|
|
|
|
if (std::shared_ptr<kp::Sequence> sq =
|
|
mgr.getOrCreateManagedSequence("createTensors").lock())
|
|
{
|
|
// ...
|
|
|
|
2. Submit OpCreateTensor to create all the tensors
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. code-block:: cpp
|
|
:linenos:
|
|
|
|
{
|
|
// ... continuing from codeblock above
|
|
|
|
sq->begin();
|
|
|
|
sq->record<kp::OpCreateTensor>(params);
|
|
|
|
sq->end();
|
|
sq->eval();
|
|
|
|
|
|
3. Record the OpAlgo with the Logistic Regresion shader
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Once we re-record, all the instructions that were recorded previosuly are cleared.
|
|
|
|
Because of this we can record now the new commands which will consist of the following:
|
|
|
|
1. Copy the tensor data from local to device
|
|
2. Run the logistic regression shader
|
|
3. Copy the output data
|
|
|
|
.. code-block:: cpp
|
|
:linenos:
|
|
|
|
{
|
|
// ... continuing from codeblock above
|
|
|
|
sq->begin();
|
|
|
|
sq->record<kp::OpTensorSyncDevice>({wIn, bIn});
|
|
|
|
sq->record<kp::OpAlgoBase<>>(
|
|
params,
|
|
false, // Whether to copy output from device
|
|
"test/shaders/glsl/test_logistic_regression.comp");
|
|
|
|
sq->record<kp::OpTensorSyncLocal>({wOutI, wOutJ, bOut});
|
|
|
|
sq->end();
|
|
|
|
4. Loop across number of iterations + 4-a. Submit algo operation on LR shader
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. code-block:: cpp
|
|
:linenos:
|
|
|
|
{
|
|
// ... continuing from codeblock above
|
|
|
|
uint32_t ITERATIONS = 100;
|
|
|
|
for (size_t i = 0; i < ITERATIONS; i++)
|
|
{
|
|
// Run evaluation which passes data through shader once
|
|
sq->eval();
|
|
|
|
|
|
4-b. Re-calculate weights from loss
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Once the shader code is executed, we are able to use the outputs from the shader calculation.
|
|
|
|
In this case we want to basically add all the calculated weights and bias from the back-prop step.
|
|
|
|
.. code-block:: cpp
|
|
:linenos:
|
|
|
|
{
|
|
// ...
|
|
for (size_t i = 0; i < ITERATIONS; i++)
|
|
{
|
|
// ... continuing from codeblock above
|
|
|
|
// Run evaluation which passes data through shader once
|
|
sq->eval();
|
|
|
|
// Substract the resulting weights and biases
|
|
for(size_t j = 0; j < bOut->size(); j++) {
|
|
wInVec[0] -= wOutI->data()[j];
|
|
wInVec[1] -= wOutJ->data()[j];
|
|
bInVec[0] -= bOut->data()[j];
|
|
}
|
|
// Set the data for the GPU to use in the next iteration
|
|
wIn->mapDataIntoHostMemory();
|
|
bIn->mapDataIntoHostMemory();
|
|
}
|
|
|
|
5. Print output weights and bias
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
.. code-block:: cpp
|
|
:linenos:
|
|
|
|
std::cout << "Weight i: " << wIn->data()[0] << std::endl;
|
|
std::cout << "Weight j: " << wIn->data()[1] << std::endl;
|
|
std::cout << "Bias: " << bIn->data()[0] << std::endl;
|
|
|
|
|
|
Logistic Regression Compute Shader
|
|
------------------------
|
|
|
|
Finally you can see the shader used for the logistic regression usecase below:
|
|
|
|
.. code-block:: cpp
|
|
:linenos:
|
|
|
|
#version 450
|
|
|
|
layout (constant_id = 0) const uint M = 0;
|
|
|
|
layout (local_size_x = 1) in;
|
|
|
|
layout(set = 0, binding = 0) buffer bxi { float xi[]; };
|
|
layout(set = 0, binding = 1) buffer bxj { float xj[]; };
|
|
layout(set = 0, binding = 2) buffer by { float y[]; };
|
|
layout(set = 0, binding = 3) buffer bwin { float win[]; };
|
|
layout(set = 0, binding = 4) buffer bwouti { float wouti[]; };
|
|
layout(set = 0, binding = 5) buffer bwoutj { float woutj[]; };
|
|
layout(set = 0, binding = 6) buffer bbin { float bin[]; };
|
|
layout(set = 0, binding = 7) buffer bbout { float bout[]; };
|
|
|
|
float learningRate = 0.1;
|
|
float m = float(M);
|
|
|
|
float sigmoid(float z) {
|
|
return 1.0 / (1.0 + exp(-z));
|
|
}
|
|
|
|
float inference(vec2 x, vec2 w, float b) {
|
|
float z = dot(w, x) + b;
|
|
float yHat = sigmoid(z);
|
|
return yHat;
|
|
}
|
|
|
|
float calculateLoss(float yHat, float y) {
|
|
return -(y * log(yHat) + (1.0 - y) * log(1.0 - yHat));
|
|
}
|
|
|
|
void main() {
|
|
uint idx = gl_GlobalInvocationID.x;
|
|
|
|
vec2 wCurr = vec2(win[0], win[1]);
|
|
float bCurr = bin[0];
|
|
|
|
vec2 xCurr = vec2(xi[idx], xj[idx]);
|
|
float yCurr = y[idx];
|
|
|
|
float yHat = inference(xCurr, wCurr, bCurr);
|
|
float loss = calculateLoss(yHat, yCurr);
|
|
|
|
float dZ = yHat - yCurr;
|
|
vec2 dW = (1. / m) * xCurr * dZ;
|
|
float dB = (1. / m) * dZ;
|
|
wouti[idx] = learningRate * dW.x;
|
|
woutj[idx] = learningRate * dW.y;
|
|
bout[idx] = learningRate * dB;
|
|
}
|
|
|
|
|
|
|
|
|