llama-cpp-turboquant/docs/overview/advanced-examples.rst
2021-03-01 07:58:12 +00:00

255 lines
9.1 KiB
ReStructuredText

.. role:: raw-html-m2r(raw)
:format: html
C++ Examples
=================
The power of Kompute comes in when the interface is used for complex computations. This section contains an outline of the advanced / end-to-end examples available.
Simple examples
^^^^^^^^^^^^^^^
* `Pass shader as raw string <#simple-shader-example>`_
* `Record batch commands with a Kompute Sequence <#record-batch-commands>`_
* `Run Asynchronous Operations <#asynchronous-operations>`_
* `Run Parallel Operations Across Multiple GPU Queues <#parallel-operations>`_
* `Create your custom Kompute Operations <#your-custom-kompute-operation>`_
* `Implementing logistic regression from scratch <#logistic-regression-example>`_
End-to-end examples
^^^^^^^^^^^^^^^^^^^
* `Machine Learning Logistic Regression Implementation <https://towardsdatascience.com/machine-learning-and-data-processing-in-the-gpu-with-vulkan-kompute-c9350e5e5d3a>`_
* `Parallelizing GPU-intensive Workloads via Multi-Queue Operations <https://towardsdatascience.com/parallelizing-heavy-gpu-workloads-via-multi-queue-operations-50a38b15a1dc>`_
* `Android NDK Mobile Kompute ML Application <https://towardsdatascience.com/gpu-accelerated-machine-learning-in-your-mobile-applications-using-the-android-ndk-vulkan-kompute-1e9da37b7617>`_
* `Game Development Kompute ML in Godot Engine <https://towardsdatascience.com/supercharging-game-development-with-gpu-accelerated-ml-using-vulkan-kompute-the-godot-game-engine-4e75a84ea9f0>`_
Asynchronous Operations
~~~~~~~~~~~~~~~~~~~~~~~
You can submit operations asynchronously with the async/await commands in the kp::Manager and kp::Sequence, which provides granularity on waiting on the vk::Fence. Back to `examples list <#simple-examples>`_
.. code-block:: cpp
:linenos:
int main() {
// You can allow Kompute to create the Vulkan components, or pass your existing ones
kp::Manager mgr; // Selects device 0 unless explicitly requested
// Creates tensor an initializes GPU memory (below we show more granularity)
auto tensor = mgr.tensor(10, 0.0);
// Define your shader as a string (using string literals for simplicity)
// (You can also pass the raw compiled bytes, or even path to file)
std::string shader(R"(
#version 450
layout (local_size_x = 1) in;
layout(set = 0, binding = 0) buffer b { float pb[]; };
shared uint sharedTotal[1];
void main() {
uint index = gl_GlobalInvocationID.x;
sharedTotal[0] = 0;
// Iterating to simulate longer process
for (int i = 0; i < 100000000; i++)
{
atomicAdd(sharedTotal[0], 1);
}
pb[index] = sharedTotal[0];
}
)");
std::vector<uint32_t> spirv = kp::Shader::compile_source(shader);
auto sq = mgr.sequence();
sq.eval<kp::OpTensorSyncDevice>({tensor});
sq.evalAsync<kp::OpAlgoDispatch>(mgr.algorithm({tensor}, spirv));
// When we're ready we can wait
// The default wait time is UINT64_MAX
sq.evalAwait(10000)
// Sync the GPU memory back to the local tensor
// We can still run synchronous jobs in our created sequence
sq.eval<kp::OpTensorSyncLocal>({ tensor });
// Prints the output: B: { 100000000, ... }
std::cout << fmt::format("B: {}",
tensor.data()) << std::endl;
}
Parallel Operations
~~~~~~~~~~~~~~~~~~~
Besides being able to submit asynchronous operations, you can also leverage the underlying GPU compute queues to process operations in parallel.
This will depend on your underlying graphics card, but for example in NVIDIA graphics cards the operations submitted across queues in one family are not parallelizable, but operations submitted across queueFamilies can be parallelizable.
Below we show how you can parallelize operations in an `NVIDIA 1650 <http://vulkan.gpuinfo.org/displayreport.php?id=9700#queuefamilies>`_\ , which has a ``GRAPHICS+COMPUTE`` family on ``index 0``\ , and ``COMPUTE`` family on ``index 2``.
Back to `examples list <#simple-examples>`_.
.. code-block:: cpp
:linenos:
int main() {
// In this case we select device 0, and for queues, one queue from familyIndex 0
// and one queue from familyIndex 2
uint32_t deviceIndex(0);
std::vector<uint32_t> familyIndices = {0, 2};
// We create a manager with device index, and queues by queue family index
kp::Manager mgr(deviceIndex, familyIndices);
// Creates tensor an initializes GPU memory (below we show more granularity)
auto tensorA = mgr.tensor({ 10, 0.0 });
auto tensorB = mgr.tensor({ 10, 0.0 });
// Copies the data into GPU memory
mgr.sequence().eval<kp::OpTensorSyncDevice>({tensorA tensorB});
// Define your shader as a string (using string literals for simplicity)
// (You can also pass the raw compiled bytes, or even path to file)
std::string shader(R"(
#version 450
layout (local_size_x = 1) in;
layout(set = 0, binding = 0) buffer b { float pb[]; };
shared uint sharedTotal[1];
void main() {
uint index = gl_GlobalInvocationID.x;
sharedTotal[0] = 0;
// Iterating to simulate longer process
for (int i = 0; i < 100000000; i++)
{
atomicAdd(sharedTotal[0], 1);
}
pb[index] = sharedTotal[0];
}
)");
std::vector<uint32_t> spirv = kp::Shader::compile_source(shader);
std::shared_ptr<kp::Algorithm> algo = mgr.algorithm({tensorA, tenssorB}, spirv);
// We need to create explicit sequences with their respective queues
// The second parameter is the index in the familyIndex array which is relative
// to the vector we created the manager with.
sqOne = mgr.sequence(0);
sqTwo = mgr.sequence(1);
// Run the first parallel operation in the `queueOne` sequence
sqOne->evalAsync<kp::OpAlgoDispatch>(algo);
// Run the second parallel operation in the `queueTwo` sequence
sqTwo->evalAsync<kp::OpAlgoDispatch>(algo);
// Here we can do other work
// We can now wait for the two parallel tasks to finish
sqOne.evalOpAwait()
sqTwo.evalOpAwait()
// Sync the GPU memory back to the local tensor
mgr.sequence()->eval<kp::OpTensorSyncLocal>({ tensorA, tensorB });
// Prints the output: A: 100000000 B: 100000000
std::cout << fmt::format("A: {}, B: {}",
tensorA.data()[0], tensorB.data()[0]) << std::endl;
}
Your Custom Kompute Operation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Build your own pre-compiled operations for domain specific workflows. Back to `examples list <#simple-examples>`_
We also provide tools that allow you to `convert shaders into C++ headers <https://github.com/EthicalML/vulkan-kompute/blob/master/scripts/convert_shaders.py#L40>`_.
.. code-block:: cpp
:linenos:
class OpMyCustom : public OpAlgoDispatch
{
public:
OpMyCustom(std::vector<std::shared_ptr<Tensor>> tensors,
std::shared_ptr<kp::Algorithm> algorithm)
: OpAlgoBase(algorithm)
{
if (tensors.size() != 3) {
throw std::runtime_error("Kompute OpMult expected 3 tensors but got " + tensors.size());
}
std::vector<uint32_t> spirv = kp::Shader::compile_source(R"(
#version 450
layout(set = 0, binding = 0) buffer tensorLhs {
float valuesLhs[ ];
};
layout(set = 0, binding = 1) buffer tensorRhs {
float valuesRhs[ ];
};
layout(set = 0, binding = 2) buffer tensorOutput {
float valuesOutput[ ];
};
layout (constant_id = 0) const uint LEN_LHS = 0;
layout (constant_id = 1) const uint LEN_RHS = 0;
layout (constant_id = 2) const uint LEN_OUT = 0;
layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
void main()
{
uint index = gl_GlobalInvocationID.x;
valuesOutput[index] = valuesLhs[index] * valuesRhs[index];
}
)");
algorithm->rebuild(tensors, spirv);
}
}
int main() {
kp::Manager mgr; // Automatically selects Device 0
// Create 3 tensors of default type float
auto tensorLhs = mgr.tensor({ 0., 1., 2. });
auto tensorRhs = mgr.tensor({ 2., 4., 6. });
auto tensorOut = mgr.tensor({ 0., 0., 0. });
mgr.sequence()
->record<kp::OpTensorSyncDevice>({tensorLhs, tensorRhs, tensorOut})
->record<kp::OpMyCustom>({tensorLhs, tensorRhs, tensorOut}, mgr.algorithm())
->record<kp::OpTensorSyncLocal>({tensorLhs, tensorRhs, tensorOut})
->eval();
// Prints the output which is { 0, 4, 12 }
std::cout << fmt::format("Output: {}", tensorOutput.data()) << std::endl;
}