Unlike traditional methods that rely on massive Transform Buffers or GameObject instances, this approach dynamically generates grass using ComputeShaders. For culling, I use a RenderTexture to store results (since SSBOs aren’t supported in many mobile VertexShaders), ensuring compatibility with most mobile devices.
This architecture significantly improves load times and performance, especially on mid- to low-end phones. Let’s break down the key design concepts step by step.
Why Not Use Traditional TransformBuffer-Based Grass?
Unity’s built-in grass and tree systems, or approaches using DrawMeshInstanced/DrawMeshInstancedIndirect, often require the CPU to prepare huge amounts of data before passing it to the GPU. When your scene contains hundreds of thousands or even millions of blades of grass, the memory and load cost of Transform Buffers becomes overwhelming.
My solution: Don’t prepare this data at all. Let the GPU handle procedural generation. This way, grass distribution, size, rotation, and wind can be generated dynamically—saving memory and increasing flexibility.
1. Creating and Binding CountBuffer and RenderTexture (No AppendBuffer)
First, create and bind the CountBuffer for instance counting in the ComputeShader. Next, set up a RenderTexture to store culling results, and bind it to both the ComputeShader and the material.
m_instanceCountBuffer = new ComputeBuffer(1, sizeof(uint),
ComputeBufferType.Counter);
m_instanceCountBuffer.SetCounterValue(0);
m_targetProceduralInstanceFilterCS.SetBuffer(0, msr_instanceCountBufferID,
m_instanceCountBuffer);
m_filterResultRT = new RenderTexture(m_filterResultSize.x,
m_filterResultSize.y, 0, RenderTextureFormat.ARGBHalf);
m_filterResultRT.enableRandomWrite = true;
m_filterResultRT.filterMode = FilterMode.Point;
m_filterResultRT.Create();
// ...
m_targetProceduralInstanceFilterCS.SetTexture(0, msr_filterResultRTID,
m_targetFilterResultRT);
m_targetGrassMaterial.SetTexture(msr_filterResultRTID,
m_targetFilterResultRT);
2. Visible Cell Culling List
Use the VisableCellsCuller class to get the visible cell list, pass it to the ComputeShader, and use it to compute world positions for grass. Since each cell contains a fixed number of grass instances, it also helps calculate thread group counts.
private void OnBeginCameraRendering(ScriptableRenderContext context,
Camera camera)
{
if (IsSkipCamera(camera))
return;
var cullingCamera = camera;
#if UNITY_EDITOR
if (m_isOnlyViewMainCameraCulling)
cullingCamera = Camera.main;
#endif //UNITY_EDITOR
Bounds renderBounds;
Vector2 cellSize;
m_grassParamsSector.GetRenderBoundAndCellSize(out renderBounds,
out cellSize);
m_visableCellsCuller.ProcessCulling(cullingCamera, in cellSize,
in m_proceduralGrassData.m_worldRect, in msr_worldMinMaxHeight);
var visibleCellIndices = m_visableCellsCuller.GetVisibleCellIndices();
var visibleCellIndexCount = visibleCellIndices.Count;
if (visibleCellIndexCount <= 0)
return;
m_targetProceduralInstanceFilterCS.SetVector(msr_cameraPositionID,
camera.transform.position);
visibleCellIndices.CopyTo(m_visibleCellIndices);
m_visibleCellIndexBuffer.SetData(m_visibleCellIndices, 0, 0,
visibleCellIndexCount);
}
3.DispatchComputeInBatches
Wrap the ComputeShader dispatch logic in a DispatchComputeInBatches function to handle the 65535 dispatch count limit. If the count exceeds this, split into multiple dispatches.
private static void DispatchComputeInBatches(ComputeShader targetComputeShader,
int processInstanceCount, int kernel = 0, int threadGroupSize = 64,
int maxDispatchCount = 65535)
{
// Set the total number of instances for the shader to access
// (for bounds checking)
targetComputeShader.SetInt(msr_maxProcessCountID, processInstanceCount);
// 'offset' tracks how many instances have already been processed
int offset = 0;
while (offset < processInstanceCount)
{
// Calculate how many instances remain to be processed in this batch
int remainInstance = processInstanceCount - offset;
// Dynamically calculate how many thread groups are needed for this batch
// This is crucial: the last batch may not fill a whole thread group,
// so we must recalculate based on the remaining instances.
int groupCountThisBatch = Mathf.CeilToInt(remainInstance / (float)threadGroupSize);
// Clamp the number of thread groups to the API limit (65535)
int dispatchThisBatch = Mathf.Min(groupCountThisBatch, maxDispatchCount);
// Set the offset for this batch so the shader knows the starting index
targetComputeShader.SetInt(msr_offestCountID, offset);
// Dispatch the current batch
targetComputeShader.Dispatch(kernel, dispatchThisBatch, 1, 1);
// Update offset to mark processed instances
offset += dispatchThisBatch * threadGroupSize;
}
}
4. ComputeShader Implementation
The core is procedural generation of transforms, using SimplexNoise for natural height and width variation. Culling includes both view frustum and a WeightMap filter, so grass only appears if its weight exceeds the minimum threshold. Results are stored in a RenderTexture.
/// <summary>
/// Author: SmallBurger Inc
/// Date: 2025/09/19
/// Desc:
/// </summary>
#pragma kernel CSMain
// ...
[numthreads(64, 1, 1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{
uint instanceIndex = id.x + _OffsetCount;
if (instanceIndex >= _MaxProcessCount)
return;
float2 position2D, randomOffest2D;
#if !defined(PROCESS_BILLBOARD)
half yawRadians;
GetInstanceTransform(instanceIndex, position2D, randomOffest2D,
yawRadians);
#else
GetInstanceTransform(instanceIndex, position2D, randomOffest2D);
#endif //!defined(PROCESS_BILLBOARD)
if ((position2D.x < _WorldMinMax.x) ||
(position2D.y < _WorldMinMax.y) ||
(position2D.x > _WorldMinMax.z) ||
(position2D.y > _WorldMinMax.w))
return;
float3 delta = _CameraPosition - float3(position2D.x, 0.0f, position2D.y);
float viewSquareDistance = dot(delta, delta);
if(viewSquareDistance > _MaxViewSquareDistance)
return;
half grassWeight = 1.0;
#ifdef PROCESS_WEIGHT_MAP_FILTER
grassWeight = _WeightMap.SampleLevel(sampler_WeightMap, GetWorldUV(position2D), 0).a;
if (grassWeight <= sc_skipWeightRatio)
return;
#endif
float4 absPosCS = abs(mul(_ViewProjectionMatrix,
float4(position2D.x, 0.0, position2D.y, 1.0)));
//do culling test in clip space, result is the same as doing
//test in NDC space.
//prefer clip space here because doing culling test in clip space
//is faster.
if ((absPosCS.z > absPosCS.w) ||
(absPosCS.y > absPosCS.w * _MaxInstanceSize) ||
(absPosCS.x > absPosCS.w * _MaxInstanceSize))
return;
uint index = _InstanceCountBuffer.IncrementCounter();
uint offestIndex = index * 2;
half2 worldUV = GetWorldUV(position2D);
_FilterResultRT[int2(offestIndex % _FilterResultRTSize.x, offestIndex / _FilterResultRTSize.x)] =
float4(worldUV, GetDestSize(position2D, grassWeight));
++offestIndex;
half wind = CalculateWind(position2D);
#ifdef PROCESS_BILLBOARD
float hash = frac(sin(dot(position2D, float2(12.9898, 78.233))) * 43758.5453);
float angle = hash * 6.2831853;
float2 randDir = float2(cos(angle), sin(angle));
half3 randomAddToN =
(_RandomNormalWeight * randDir.x) * _CameraRightWS +
(_RandomNormalWeight * randDir.y) * _CameraForwardWS;
half3 windAddToN = wind * (-0.25 * _CameraRightWS);
half3 normalWS = normalize(half3(0,1,0) + randomAddToN);
#endif
_FilterResultRT[int2(offestIndex % _FilterResultRTSize.x, offestIndex / _FilterResultRTSize.x)] =
float4(
wind,
#if !defined(PROCESS_BILLBOARD)
sin(yawRadians),
cos(yawRadians),
0.0
#else
normalWS
#endif //!defined(PROCESS_BILLBOARD)
);
}
5. WeightMapFilter: Artistic Control Over Grass Distribution
Procedural generation alone can make grass look “too regular.” I added a WeightMapFilter mechanism so artists can paint grass weights directly onto a texture. The ComputeShader uses this filter map during generation, allowing artists to control large-scale distribution while procedural generation handles fine details.
With FilterMap
Without FilterMap
6. Prefer ComputeShader Over VertexShader for Instance Logic
Many DrawMeshInstancedIndirect solutions compute normals, random size, rotation, wind, etc., in the VertexShader. If you only need one instance’s data, move as much logic as possible to the ComputeShader.
half wind = CalculateWind(position2D);
#ifdef PROCESS_BILLBOARD
float hash = frac(sin(dot(position2D, float2(12.9898, 78.233))) * 43758.5453);
float angle = hash * 6.2831853;
float2 randDir = float2(cos(angle), sin(angle));
half3 randomAddToN =
(_RandomNormalWeight * randDir.x) * _CameraRightWS +
(_RandomNormalWeight * randDir.y) * _CameraForwardWS;
half3 windAddToN = wind * (-0.25 * _CameraRightWS);
half3 normalWS = normalize(half3(0,1,0) + randomAddToN);
#endif
_FilterResultRT[int2(offestIndex % _FilterResultRTSize.x, offestIndex / _FilterResultRTSize.x)] =
float4(
wind,
#if !defined(PROCESS_BILLBOARD)
sin(yawRadians),
cos(yawRadians),
0.0
#else
normalWS
#endif //!defined(PROCESS_BILLBOARD)
7. GrassShader Implementation
Parse RenderTexture data: each instance uses two pixels—one for position and size, another for wind and normal (or rotation for non-billboard mode).
void GetInstanceTransform(in uint instanceID, in float4 filterResultRTTexelSize, in float4 positionOS,
in half3 normalOS, in half2 windDirection, in float4 interactorCollisionSphere,
in half interactorAffectWeight, in half windNormalWeight,
out half2 worldUV,
out float3 positionWS,
out half affectWeight,
out half3 normalWS,
out float viewDistance)
{
float2 position2D;
half2 sizeFactor;
half wind;
#ifdef PROCESS_BILLBOARD
GetGrassInstanceData(instanceID, filterResultRTTexelSize, worldUV, position2D, sizeFactor, wind,
normalWS);
#else
half yawSin, yawCos;
GetGrassInstanceData(instanceID, filterResultRTTexelSize, worldUV, position2D, sizeFactor, wind,
yawSin,
yawCos);
#endif //PROCESS_BILLBOARD
float3 instancePosition = float3(position2D.x, 0.0, position2D.y);
viewDistance = length(_WorldSpaceCameraPos - instancePosition);
affectWeight = positionOS.y;
#if !defined(PROCESS_BILLBOARD)
GetRotateWorldPosition(positionOS.xyz, instancePosition, sizeFactor, yawSin, yawCos, positionWS);
#else
GetBillboardWorldPosition(positionOS.xyz, instancePosition, sizeFactor, viewDistance, positionWS);
#endif
float2 windOffest = windDirection * wind * affectWeight;
positionWS.xz += windOffest;
ApplyInteractorOffest(instancePosition, interactorCollisionSphere,
step(0.5, affectWeight) * interactorAffectWeight, positionWS);
#if !defined(PROCESS_BILLBOARD)
CalculateNormal(normalOS, yawSin, yawCos, windNormalWeight * affectWeight, windOffest, normalWS);
#endif
}
8. Collision Handling
As mentioned, try to move as much logic as possible to the ComputeShader, especially when handling multiple interactors—doing this in the VertexShader is not ideal for performance. In this project, since there is only one interactor and moving it to the ComputeShader would require extra pixels, I kept it in the VertexShader. But feel free to refactor if needed.
void ApplyInteractorOffest(in float3 instancePosition, in float4 collisionSphere,
in half applyWeight, inout float3 positionWS)
{
float3 delta = instancePosition - collisionSphere.xyz;
float squareDistance = dot(delta, delta);
float maxCheckRadius = _MaxInstanceSize + collisionSphere.w;
float maxCheckSquareRadius = maxCheckRadius * maxCheckRadius;
if (squareDistance > maxCheckSquareRadius)
return;
// linear falloff (1 at center, 0 at radius) — can be changed to smoothstep/pow/exp
float falloff = saturate(1.0 - (squareDistance / maxCheckSquareRadius));
// Direction: from interactor to instance (avoid divide-by-zero)
float3 dir = (squareDistance > 1e-5) ? normalize(delta) : float3(0.0, 0.0, 0.0);
// Use interactor.w as base strength (multiply global scale here if needed)
float magnitude = falloff * collisionSphere.w;// *_InteractorAffectWeight;
// Storage format: float4(dx, dy, dz, strength)
float3 displacement = dir * magnitude;
positionWS.xz += applyWeight * displacement.xz;
}
9. Loading Time
With no precomputed Transforms or runtime initialization, loading is extremely fast—often under 0.1 seconds. (Games like Ghost of Tsushima likely use similar techniques.)
Demo Video
Github
UnityURP-Procedural-DrawMeshInstancedIndirect
References & Further Reading
If you want to support more vegetation types, advanced brush tools (ComputeShader-based brushes), optimized multi-interactor collision flows, or mesh-conforming features, check out:
GPUPlantPainter
For more complex procedural workflows (procedural random curve vegetation effects, similar to Ghost of Tsushima), check out:
GPUGrassBladePainter
沒有留言:
張貼留言