ARTEMIS, CS 184 Final Project

In this project, we provide an interactive flocking particle simulation which forms into point clouds generated by a text-to-3D model and implements GPU-accelerated ray marching and ray tracing for rendering. Our project was built from scratch from a barebones OpenCL sample.

There are two versions of the project with different focuses. One has the full 4096 boids with very basic ray tracing and is optimized for frame rate, while the other has only 512 boids and implements ray marching with smoothing as well as ray tracing with light blocking and soft shadows.

Our ray tracing and smoothed ray marching algorithm support large point clouds and interactive camera movements. Furthermore, we implemented a pipeline using pre-trained machine learning models to generate 3D point clouds from text. In order to accelerate our rendering, we utilized a wide array of optimization techniques, including using GPU acceleration from the OpenCL framework, BVH trees, and Morton codes for Z-order space-filling curves. We also implemented a boid flocking algorithm with a force toward our pre-generated point clouds, which smoothly interpolates boid positions and colors. With 4096 boids, we achieve 60 FPS, compared to several seconds per frame without acceleration.

Our approach is shown in the figure above. To collect point clouds, we:

Generate text prompts with an LLM. We ran two prompts through Claude 3 independently to generate lists of prompts: one for real-life objects and well known characters for which I’d be able to find images on Google, and one for easily recognizable characters and objects performing whimsical actions.
For the first list, we scraped images off Google — partially manually and partially with the Serp API. For the second list, we ran prompts through Dall-E 3. In each case, we specified that the image should have a white background.
Preprocess the images, using a publicly available background remover to remove image backgrounds and resizing and centering images.
Input the resulting images to Point-E, which generates a 3D point cloud. We experimented with all combinations of {preprocessed, not preprocessed} and {300M, 1B} model size, yielding a size-1024 point cloud $(x, y, z, r, g, b)$. Then we ran a pre-trained upsampler to yield a size-4096 point cloud of the same format. The model weights are publicly available, and we ran the scripts on one V100 using Google Compute Engine. The total inference runtime for 654 images was around 6 hours. (Our original hope in this project was that a user would enter text and have a point cloud within seconds, but this inference procedure is the bottleneck, in terms of time and cost.)
In general, we find that skipping the preprocessing step consistently results in noise or artifacts in the edges and corners of generated point clouds. We do not see a strong change in model performance between 300M and 1B parameters for contiguous objects, and that the common failure modes of extraneous black pixels and geometrically flat outputs appear consistently in both cases. For scenes containing multiple disjoint objects, we found that the 1B model generally performed better.

The last step above resulted in point clouds with highly variable quality. We wrote a script to efficiently manually process these outputs. Out of around 150 randomly-sampled outputs, 40 were selected for use in our end product, and 16 are available to view in the application.

Our flocking model incorporates the forces listed above. Each force has an associated radius, indicating that the force between two boids will only apply in the case that they are within some maximum $\ell_2$ distance. Each force also has a corresponding coefficient. These coefficients needed to be tuned quite carefully to achieve reasonable flocking behaviors: for instance, separation requires a specific value such that the boids do not suffer from mode collapse, and we introduced a weak “centering” force toward the origin to counteract the effect of “containing” trapping boids to the surface of a sphere. Our “arrival” force uses targets as the $(x,y,z)$ points generated in our point cloud procedure, which have been randomized by default to enable more cohesive transitions.

We originally implemented our flocking algorithm in Unity. When we ported the code over to an OpenCL kernel, we found that the units were different, and every coefficient needed to be updated to work well with the rest of the codebase. We experienced some challenges with reading in the .npy outputs from Point-E in C++ due to issues in the cnpy documentation, and ended up manually handling these issues after reading in the arrays.

Our particle simulation is implemented using forward Euler. We found that it was sufficiently numerically stable and left a small memory footprint.

Particle colors are updated by linear interpolation, with weights determined by the timestep. That is, $\mathbf c(t+1) =(1 - \Delta t) \mathbf c(t) + \Delta t \mathbf c_{\text{target}}(t)$, where $\mathbf c = (r, g, b)$ denotes a particle's color as a function of time and $\mathbf c_{\text{target}}$ denotes the target color given by Point-E.

Boid flocking with 4096 boids

Our ray marching algorithm was originally built on top of Homework 3 and tested in that codebase. We also enabled changing camera positions and continuous imaging renders. However, we found that the result was very laggy and rendered images slowly; simply streaming camera inputs is insufficient to render the whole scene smoothly in real time. Thus, we chose to accelerate this on the GPU using OpenCL, which we discuss in the following section.

Below is a video of our original approach.

We implemented smoothing with ray marching, such that two nearby spheres can merge and thus smooth out their collective surface. There are a wide variety of available smooth-min functions. We implemented the following approach: let $d_1$ denote the distance from the camera to the first sphere and $d_2$ to the second sphere. We also let $b = 10^{-3}$ be a buffer constant. Then ray marching uses a distance defined by $$\min(d_1, d_2) - \frac b6 \left(\frac{\max(b - |d_1 - d_2|, 0)}{b}\right)^3$$

Computing normal vectors for smoothed surfaces relies on the signed distance field (SDF), which is defined by $f(\mathbf p) = \min_{\text{sphere}\ s = (\mathbf o, r)} \| \mathbf p - \mathbf o \| - r$ (that is, the distance to the nearest point on any sphere). Then we evaluate the gradient at $f(\mathbf p)$ with respect to each of the 6 positive and negative axis directions, and use these partial derivatives as components to compute the normal vector on the smoothed surface at $\mathbf p$.

Blue Whale with smoothing, 512 boids

Boid flocking with smoothing, 1024 boids

We implemented traditional ray tracing as well using OpenCL as a faster rendering option. We based our implementation off of Homework 3 and translated it to an OpenCL kernel with specifically the task of rendering multiple colored, diffuse spheres with a single area light with multiple sample rays and multiple direct lighting samples. Due to OpenCL limitations and a lack of time, we were only able to implement zero bounce and direct lighting radiance. We were able to implement light blocking, however, leading to some nice soft shadow effects. Unfortunately, in order to maintain a smooth frame rate, we do have a large amount of noise present in our final renders.

Basketball with light blocking

Due to the low frame rate of CPU ray tracing, we decided to accelerate our rendering using OpenCL. Since Apple dropped support for OpenGL, we were unable to use the built in OpenGL compute shaders for our ray tracing kernel. After surveying options such as Vulkan, Metal, and OpenCL, we ultimately decided on OpenCL due to its platform portability and its mix of ease of use and low level optimization capabilities. We used OpenCL in this project to parallelize the flocking algorithm and all three of our rendering implementations.

Our flocking algorithm was split into two kernels, one for calculating accelerations based on the positions of other boids and one for integrating the acceleration and updating position and velocity. These kernels were executed once for each particle in the simulation on the GPU.

In our ray tracing and ray marching kernels, we randomly generate rays from each pixel of the camera and check for intersections with each sphere in the scene. If we detect a collision, we then cast extra rays from that collision point, randomly sampling from the area light in order to determine the radiance for that pixel. This is run in parallel on the GPU with one thread per pixel. For ray marching, we had to iterate our position based on the signed distance field to find intersections which significantly increased rendering time but allowed us to implement smoothing without having to modify meshes.

We ran into a number of challenges when using OpenCL. None of us had any GPU experience on this level or any experience with OpenCL, so we had to learn the OpenCL environment from scratch, and had to navigate through GPU specific concepts such as work group thread synchronization through barrier calls, command queues, and GPU buffers.

Deprecation

Apple dropped support for OpenCL in favor of their proprietary Metal API, leading to some incredibly difficult to trace bugs. After 4 hours of debugging on our first draft of the ray tracing compute shader, we realized that functions with pointer arguments had unpredictable behavior, and we were able to proceed with our implementation by limiting the use of helper functions.

Speed

Despite reducing the number of operations by an order of magnitude, we still needed a number of optimizations in order to increase the frame rate to an acceptable amount for a stable simulation and comfortable user experience. These optimizations include utilizing local memory and work groups on the GPU, choosing an optimal work group size, and using an acceleration structure. These optimizations are benchmarked in the tables in the Results section.

Our work group optimizations were generally based on finding a spatial sorting of the points using a permutation index array that grouped together points that were local and minimized the bounding boxes for each. This is a very similar approach to a standard BVH, however we chose to uniformly fix the size of each leaf to match the work group size and only generate bounding boxes for the leaves instead of traversing the BVH hierarchy from the root.

Then, for the intersection check in our ray tracing kernel, each thread would first intersect their ray with the bounding box of the current BVH leaf. If all rays missed the bounding box, all threads in the work group would skip to the next BVH leaf. In this way, we can take advantage of the fact that rays from within a work group have similar trajectories and avoid expensive unnecessary accesses to global memory.

We tested two different acceleration structures to generate the BVH leaves used in our acceleration algorithm.

We implemented a low-depth BVH with axis splits along the longest axis, with the motivation that we want our leaf nodes to have uniform dimensions. Specifically, each split exactly partitioned the points into two equally-sized groups, so that we have an exactly balanced binary tree. The motivation is that we have 16 work groups, each of size 256, so splitting four times will exactly split boids into work groups.
We also implemented Morton codes for a Z-order space-filling curve, following the description here. For a point $(x, y, z)$, we compute the binary representations $x = 0.x_1 x_2 \dots$ (where $x_i \in \{0, 1\}$), and likewise for $y_i$ and $z_i$. Encoding the point as $0.x_1 y_1 z_1 x_2 y_2 z_2 \dots$ results in a spatially-local hash, i.e. points that have faraway encodings are also far away on the curve. This parameterization for a space-filling curve is motivated by parallel radix sort.
A disadvantage of the Z-order curve is that there are occasionally large jumps between consecutive points, which results in discontinuous work groups and thus excessively large bounding boxes with moderate probability.

Below are demonstrations of our two approaches (this mode is available with a keyboard toggle), where each color represents a leaf node.

BVH Acceleration

Z-Order Sort Acceleration

Limitations

OpenCL uses a subset of a version of C from 1999 and does not support passing arguments by reference or recursion, which significantly limited our implementations of ray tracing, which is a recursive algorithm. Combined with the inability to pass pointers into a function and the low frame rate of our zero bounce results, we were unable to implement bounces or any more advanced ray tracing techniques.

OpenCL also does not have a build in random function, so we had to use a #define macro and some bit tricks found online to make a custom pseudorandom number generator that could generate multiple random numbers in a thread in a row, and maintain randomness between kernel dispatches by saving the random seed to a buffer between frames.

We began with starter code that rendered fractals to a window, and switched between fractals using keyboard interactions. Accordingly, we needed to set up some functionalities to adapt our project to this environment, including:

Camera movement, zoom through click, drag, and scroll.
Keyboard inputs to switch between different point clouds and pause the simulation.
Debug scripts to show each BVH leaf node's particles in a different color.
Integrated ray marching and ray tracing kernels with keybind to toggle.

All of our experiments use a pixel resolution of 800 x 600. All FPS and timing data are averages over 500 frames.

Ray algorithm

Our experiments use 4096 boids, 1 ray per pixel, and 1 light sample per pixel.

We observe that simulation time is generally negligible compared to rendering time for higher quality renders, and that ray marching with smoothing is roughly 7 times slower than traditional ray tracing with light blocking.

	FPS	Ray casting time (ms)	Simulation time (ms)
Ray tracing (light blocking)	33.4	24.4	3.9
Ray marching	6.1	174.3	4.0
Ray tracing (no light blocking)	81.9	5.4	4.3

Work group size

Our experiments use 4096 boids, 1 ray per pixel, and 1 light sample per pixel.

With a larger work group size, we can better take advantage of local memory speedups, however we have larger bounding boxes so we have a smaller chance to skip BVH leaves. After experimenting with work group size, we found that an 8x8 work group offered the best performance when combined with the BVH, giving us 64 leaves with 64 spheres each. Compared to the 16x16 maximal work group size offered by the Apple M1 GPU, this was a roughly 25% speedup.

	FPS	Rendering time (ms)
16 x 16	37	20.2
8 x 8	44	15.6

Acceleration algorithm

Our experiments use ray tracing with light blocking, 4096 boids, 1 ray per pixel, 4 light samples per pixel, and a 16x16 work group size.

We implemented a basic BVH of depth 4 with binary splits along the longest axis of the bounding box. This gave us 16 leaves of size 256, which matches the size of each work group. We also attempted to sort the points along the Z-order curve, which is a space filling curve. In this way, points within each contiguous chunk of 256 should be close to each other.

While the Z-order sort has a considerably lower construction time, it is about 50% slower than the basic BVH, likely due to the discontinities causing unnecessarily large bounding boxes from our arbitrary chunking.

This optimization gave us a 4x speedup with the BVH approach.

	FPS	Construction time (ms)	Rendering time (ms)	Simulation time (ms)
None	13.4	0	67.1	3.9
BVH	42.6	0.94	16.9	4.0
Z-order sort	33	0.35	24	4.0

GPU Memory Optimization

Our experiments use ray tracing without light blocking, 4096 boids, 4 rays per pixel, and 4 light samples per pixel.

Memory from the host program on the CPU must be uploaded to the global memory of the GPU, but each work group on the GPU has access to a shared local memory within the work group that is faster. We can take advantage of this by having each thread fetch data from global memory in parallel to local memory, and then running all required $O(n)$ operations on local memory instead. Between these fetches, we must also call barrier() to ensure thread synchronization within the work group.

This optimization gave us about a 33% speedup.

	FPS	Rendering time (ms)	Simulation time (ms)
Global memory	34.3	22.2	4.0
Local memory	46	14.9	4.0

Ralph Cao

Unity demo for the flocking algorithm
Reimplemented real time ray tracing in OpenCL from scratch based on homework 3, along with rewriting the camera math and camera movement code
Implemented flocking algorithm in OpenCL using a compute shader for speedup
Tested and implemented different GPU optimizations listed above
Implemented BVH visualizations

Arjun Damerla

Implemented Ray Marching
Implemented different smoothing algorithms for ray marching, ultimately settling on one of them

Preston Fu

Point cloud pipeline (LM prompting, 2D image generation, 3D point cloud generation, filtering, C++ numpy input)
Parts of the camera and flocking kernels
Timing utility functions for performance testing

Julia Sun

Assisted with first iteration of ray marching
Created project propsal, milestone report, and finalized final project webpage

ARTEMIS: Accelerated Ray Tracing and Enhanced Marching for Integrated flocking with point cloudS

Abstract

Point Clouds

Flocking

Ray Marching

Ray Tracing

OpenCL

Deprecation

Speed

Limitations

Application Development

Experiments