In this project, we provide an interactive flocking particle simulation which forms into point clouds generated by a text-to-3D model and implements GPU-accelerated ray marching and ray tracing for rendering. Our project was built from scratch from a barebones OpenCL sample.
There are two versions of the project with different focuses. One has the full 4096 boids with very basic ray tracing and is optimized for frame rate, while the other has only 512 boids and implements ray marching with smoothing as well as ray tracing with light blocking and soft shadows.
Our ray tracing and smoothed ray marching algorithm support large point clouds and interactive camera movements. Furthermore, we implemented a pipeline using pre-trained machine learning models to generate 3D point clouds from text. In order to accelerate our rendering, we utilized a wide array of optimization techniques, including using GPU acceleration from the OpenCL framework, BVH trees, and Morton codes for Z-order space-filling curves. We also implemented a boid flocking algorithm with a force toward our pre-generated point clouds, which smoothly interpolates boid positions and colors. With 4096 boids, we achieve 60 FPS, compared to several seconds per frame without acceleration.
Our approach is shown in the figure above. To collect point clouds, we:
The last step above resulted in point clouds with highly variable quality. We wrote a script to efficiently manually process these outputs. Out of around 150 randomly-sampled outputs, 40 were selected for use in our end product, and 16 are available to view in the application.
Our flocking model incorporates the forces listed above. Each force has an associated radius, indicating that the force between two boids will only apply in the case that they are within some maximum $\ell_2$ distance. Each force also has a corresponding coefficient. These coefficients needed to be tuned quite carefully to achieve reasonable flocking behaviors: for instance, separation requires a specific value such that the boids do not suffer from mode collapse, and we introduced a weak “centering” force toward the origin to counteract the effect of “containing” trapping boids to the surface of a sphere. Our “arrival” force uses targets as the $(x,y,z)$ points generated in our point cloud procedure, which have been randomized by default to enable more cohesive transitions.
We originally implemented our flocking algorithm in Unity. When we ported the code over to an OpenCL kernel, we found that the units were different, and every coefficient needed to be updated to work well with the rest of the codebase. We experienced some challenges with reading in the .npy outputs from Point-E in C++ due to issues in the cnpy documentation, and ended up manually handling these issues after reading in the arrays.
Our particle simulation is implemented using forward Euler. We found that it was sufficiently numerically stable and left a small memory footprint.
Particle colors are updated by linear interpolation, with weights determined by the timestep. That is, $\mathbf c(t+1) =(1 - \Delta t) \mathbf c(t) + \Delta t \mathbf c_{\text{target}}(t)$, where $\mathbf c = (r, g, b)$ denotes a particle's color as a function of time and $\mathbf c_{\text{target}}$ denotes the target color given by Point-E.
|
Our ray marching algorithm was originally built on top of Homework 3 and tested in that codebase. We also enabled changing camera positions and continuous imaging renders. However, we found that the result was very laggy and rendered images slowly; simply streaming camera inputs is insufficient to render the whole scene smoothly in real time. Thus, we chose to accelerate this on the GPU using OpenCL, which we discuss in the following section.
Below is a video of our original approach.
We implemented smoothing with ray marching, such that two nearby spheres can merge and thus smooth out their collective surface. There are a wide variety of available smooth-min functions. We implemented the following approach: let $d_1$ denote the distance from the camera to the first sphere and $d_2$ to the second sphere. We also let $b = 10^{-3}$ be a buffer constant. Then ray marching uses a distance defined by $$\min(d_1, d_2) - \frac b6 \left(\frac{\max(b - |d_1 - d_2|, 0)}{b}\right)^3$$
Computing normal vectors for smoothed surfaces relies on the signed distance field (SDF), which is defined by $f(\mathbf p) = \min_{\text{sphere}\ s = (\mathbf o, r)} \| \mathbf p - \mathbf o \| - r$ (that is, the distance to the nearest point on any sphere). Then we evaluate the gradient at $f(\mathbf p)$ with respect to each of the 6 positive and negative axis directions, and use these partial derivatives as components to compute the normal vector on the smoothed surface at $\mathbf p$.
|
|
We implemented traditional ray tracing as well using OpenCL as a faster rendering option. We based our implementation off of Homework 3 and translated it to an OpenCL kernel with specifically the task of rendering multiple colored, diffuse spheres with a single area light with multiple sample rays and multiple direct lighting samples. Due to OpenCL limitations and a lack of time, we were only able to implement zero bounce and direct lighting radiance. We were able to implement light blocking, however, leading to some nice soft shadow effects. Unfortunately, in order to maintain a smooth frame rate, we do have a large amount of noise present in our final renders.
Due to the low frame rate of CPU ray tracing, we decided to accelerate our rendering using OpenCL. Since Apple dropped support for OpenGL, we were unable to use the built in OpenGL compute shaders for our ray tracing kernel. After surveying options such as Vulkan, Metal, and OpenCL, we ultimately decided on OpenCL due to its platform portability and its mix of ease of use and low level optimization capabilities. We used OpenCL in this project to parallelize the flocking algorithm and all three of our rendering implementations.
Our flocking algorithm was split into two kernels, one for calculating accelerations based on the positions of other boids and one for integrating the acceleration and updating position and velocity. These kernels were executed once for each particle in the simulation on the GPU.
In our ray tracing and ray marching kernels, we randomly generate rays from each pixel of the camera and check for intersections with each sphere in the scene. If we detect a collision, we then cast extra rays from that collision point, randomly sampling from the area light in order to determine the radiance for that pixel. This is run in parallel on the GPU with one thread per pixel. For ray marching, we had to iterate our position based on the signed distance field to find intersections which significantly increased rendering time but allowed us to implement smoothing without having to modify meshes.
We ran into a number of challenges when using OpenCL. None of us had any GPU experience on this level or any experience with OpenCL, so we had to learn the OpenCL environment from scratch, and had to navigate through GPU specific concepts such as work group thread synchronization through barrier calls, command queues, and GPU buffers.
Apple dropped support for OpenCL in favor of their proprietary Metal API, leading to some incredibly difficult to trace bugs. After 4 hours of debugging on our first draft of the ray tracing compute shader, we realized that functions with pointer arguments had unpredictable behavior, and we were able to proceed with our implementation by limiting the use of helper functions.
Despite reducing the number of operations by an order of magnitude, we still needed a number of optimizations in order to increase the frame rate to an acceptable amount for a stable simulation and comfortable user experience. These optimizations include utilizing local memory and work groups on the GPU, choosing an optimal work group size, and using an acceleration structure. These optimizations are benchmarked in the tables in the Results section.
Our work group optimizations were generally based on finding a spatial sorting of the points using a permutation index array that grouped together points that were local and minimized the bounding boxes for each. This is a very similar approach to a standard BVH, however we chose to uniformly fix the size of each leaf to match the work group size and only generate bounding boxes for the leaves instead of traversing the BVH hierarchy from the root.
Then, for the intersection check in our ray tracing kernel, each thread would first intersect their ray with the bounding box of the current BVH leaf. If all rays missed the bounding box, all threads in the work group would skip to the next BVH leaf. In this way, we can take advantage of the fact that rays from within a work group have similar trajectories and avoid expensive unnecessary accesses to global memory.
We tested two different acceleration structures to generate the BVH leaves used in our acceleration algorithm.
Below are demonstrations of our two approaches (this mode is available with a keyboard toggle), where each color represents a leaf node.
|
|
OpenCL uses a subset of a version of C from 1999 and does not support passing arguments by reference or recursion, which significantly limited our implementations of ray tracing, which is a recursive algorithm. Combined with the inability to pass pointers into a function and the low frame rate of our zero bounce results, we were unable to implement bounces or any more advanced ray tracing techniques.
OpenCL also does not have a build in random function, so we had to use a #define macro and some bit tricks found online to make a custom pseudorandom number generator that could generate multiple random numbers in a thread in a row, and maintain randomness between kernel dispatches by saving the random seed to a buffer between frames.
We began with starter code that rendered fractals to a window, and switched between fractals using keyboard interactions. Accordingly, we needed to set up some functionalities to adapt our project to this environment, including:
All of our experiments use a pixel resolution of 800 x 600. All FPS and timing data are averages over 500 frames.
Our experiments use 4096 boids, 1 ray per pixel, and 1 light sample per pixel.
We observe that simulation time is generally negligible compared to rendering time for higher quality renders, and that ray marching with smoothing is roughly 7 times slower than traditional ray tracing with light blocking.
FPS | Ray casting time (ms) | Simulation time (ms) | |
---|---|---|---|
Ray tracing (light blocking) | 33.4 | 24.4 | 3.9 |
Ray marching | 6.1 | 174.3 | 4.0 |
Ray tracing (no light blocking) | 81.9 | 5.4 | 4.3 |
Our experiments use 4096 boids, 1 ray per pixel, and 1 light sample per pixel.
With a larger work group size, we can better take advantage of local memory speedups, however we have larger bounding boxes so we have a smaller chance to skip BVH leaves. After experimenting with work group size, we found that an 8x8 work group offered the best performance when combined with the BVH, giving us 64 leaves with 64 spheres each. Compared to the 16x16 maximal work group size offered by the Apple M1 GPU, this was a roughly 25% speedup.
FPS | Rendering time (ms) | |
---|---|---|
16 x 16 | 37 | 20.2 |
8 x 8 | 44 | 15.6 |
Our experiments use ray tracing with light blocking, 4096 boids, 1 ray per pixel, 4 light samples per pixel, and a 16x16 work group size.
We implemented a basic BVH of depth 4 with binary splits along the longest axis of the bounding box. This gave us 16 leaves of size 256, which matches the size of each work group. We also attempted to sort the points along the Z-order curve, which is a space filling curve. In this way, points within each contiguous chunk of 256 should be close to each other.
While the Z-order sort has a considerably lower construction time, it is about 50% slower than the basic BVH, likely due to the discontinities causing unnecessarily large bounding boxes from our arbitrary chunking.
This optimization gave us a 4x speedup with the BVH approach.
FPS | Construction time (ms) | Rendering time (ms) | Simulation time (ms) | |
---|---|---|---|---|
None | 13.4 | 0 | 67.1 | 3.9 |
BVH | 42.6 | 0.94 | 16.9 | 4.0 |
Z-order sort | 33 | 0.35 | 24 | 4.0 |
Our experiments use ray tracing without light blocking, 4096 boids, 4 rays per pixel, and 4 light samples per pixel.
Memory from the host program on the CPU must be uploaded to the global memory of the GPU, but each work group on the GPU has access to a shared local memory within the work group that is faster. We can take advantage of this by having each thread fetch data from global memory in parallel to local memory, and then running all required $O(n)$ operations on local memory instead. Between these fetches, we must also call barrier() to ensure thread synchronization within the work group.
This optimization gave us about a 33% speedup.
FPS | Rendering time (ms) | Simulation time (ms) | |
---|---|---|---|
Global memory | 34.3 | 22.2 | 4.0 |
Local memory | 46 | 14.9 | 4.0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
External resources were useful for the following: