SmallBurger: Development Insights: GPUDrivenForwardPlus

I have always been deeply fascinated by Forward Plus lighting architectures. Before the official URP Forward Plus was released, I implemented a custom Forward Plus solution for URP based on Depth Pre-pass, known as ForwardPlusURP

The core logic of that version followed the classic Tile-based Forward+ approach: it rendered a depth pre-pass to calculate the depth range for each tile and performed light culling to reduce the cost of light loops in the fragment shader.

The Challenges of Mobile Platforms

However, during practical testing and development, I realized that this path faced structural issues on mobile platforms:

Cost of Depth Pre-pass: On mobile GPUs (TBDR architecture), a depth pre-pass adds a significant geometry rendering stage, which often disrupts hardware-level optimizations like cache efficiency and Early-Z behavior.
Dependency on Interlock Operations: The implementation relied heavily on Pixel Interlock (or similar atomic operations) to maintain light lists. On mobile GPUs—specifically Mali and PowerVR architectures—this often leads to pipeline serialization and unstable performance.

Due to these factors, I decided not to pursue that specific direction further.

A New Inspiration: Clustered Shading

When the official URP Forward Plus (based on CPU-side Job System light culling) was released, it provided a different perspective: a "stable, predictable, and depth-pre-pass-independent" light management model.

This sparked a new question: Is it possible to design a GPU-driven light culling architecture specifically for mobile—one that doesn't rely on depth pre-passes or heavy interlocks, while maintaining compatibility with the standard Forward workflow?

This goal led me to explore Clustered Forward Plus, resulting in the implementation of GPUDrivenForwardPlus.

Implementation Workflow

Frustum Voxelization: Subdivide the Camera's View Frustum into AABB Clusters.
GPU Light Culling: Use a Compute Shader to perform intersection tests between AABB Clusters and visible lights.
Light Indexing: Record the culling results, including light indices and counts, into a buffer.
Shading: Convert screen-space coordinates to the corresponding cluster index to retrieve light data for final shading.

Key Features & Optimizations

Custom Light Culling: Since Unity’s SRP default Visible Lights count is limited to 32 (including the main light), I re-implemented the culling logic to support up to 64 lights per camera.
Performance Balancing: To ensure performance stability, each cluster currently supports a maximum of 8 additional lights.
Mobile Friendly: By avoiding the Depth Pre-pass, this solution is highly efficient for mobile hardware and naturally supports Alpha-blended (Transparent) objects.
Supported Light Types: Currently supports Point Lights and Spot Lights.