Performance Tuning

Packing Size

The packing parameter significantly affects proof size:

Smaller values: Faster proving, larger proofs
Larger values: Slower proving, smaller proofs

Recommendation: Experiment with different packing sizes (4096, 8192, 16384) to find the optimal trade-off for your application.

Example

# Test with different packing sizes
./webgpu_prover '{"program": "app.wasm", "packing": 4096, "args": []}'
./webgpu_prover '{"program": "app.wasm", "packing": 8192, "args": []}'
./webgpu_prover '{"program": "app.wasm", "packing": 16384, "args": []}'

Compare proof generation time and proof size for each configuration.

GPU Threads

The gpu-threads parameter controls parallelism:

Can exceed physical GPU core count
Higher values may improve performance on high-end GPUs
Start with the default (= packing size) and adjust

Example

# Custom GPU thread count
./webgpu_prover '{
  "program": "app.wasm",
  "packing": 8192,
  "gpu-threads": 16384,
  "args": []
}'

Constraint Optimization

Use Checked Functions Sparingly

Checked functions (e.g., bn254fr_addmod_checked) automatically add constraints. Use them only when necessary:

// Good: Manual constraint when needed
bn254fr_addmod(sum, a, b);
// ... more operations ...
bn254fr_assert_add(sum, a, b);  // Add constraint at the end

// Less optimal: Immediate constraint for every operation
bn254fr_addmod_checked(sum, a, b);  // Adds constraint immediately

Batch Operations

Process multiple values together when possible:

// Less optimal: Process one at a time
for (int i = 0; i < 1000; i++) {
    bn254fr_mulmod(results[i], inputs[i], constant);
}

// Better: Use vectorized operations
vbn254fr_t vec_inputs, vec_results;
vbn254fr_alloc(vec_inputs, 1000);
vbn254fr_alloc(vec_results, 1000);
// ... populate vec_inputs ...
vbn254fr_mulmod_constant(vec_results, vec_inputs, &constant);

Minimize Constraint Depth

Keep your constraint graph shallow:

// Deep: Many sequential dependencies
bn254fr_mulmod(t1, a, b);
bn254fr_mulmod(t2, t1, c);
bn254fr_mulmod(t3, t2, d);
bn254fr_mulmod(result, t3, e);

// Shallow: Parallel operations
bn254fr_mulmod(t1, a, b);
bn254fr_mulmod(t2, c, d);
bn254fr_mulmod(result, t1, t2);
bn254fr_mulmod_checked(result, result, e);

Memory Management

Free field elements as soon as possible:

// Good: Free when done
{
    bn254fr_t temp;
    bn254fr_alloc(temp);
    // ... use temp ...
    bn254fr_free(temp);
}  // temp is freed

// Less optimal: Keep allocated longer than needed
bn254fr_t temp;
bn254fr_alloc(temp);
// ... use temp ...
// ... lots of other code ...
bn254fr_free(temp);  // Freed much later

Profiling

Monitor proof generation time and identify bottlenecks:

time ./webgpu_prover '{
  "program": "app.wasm",
  "args": []
}'

The prover will output performance metrics including:

Constraint generation time
FFT computation time
Total proof generation time

Packing Size​

Example​

GPU Threads​

Example​

Constraint Optimization​

Use Checked Functions Sparingly​

Batch Operations​

Minimize Constraint Depth​

Memory Management​

Profiling​

Packing Size

Example

GPU Threads

Example

Constraint Optimization

Use Checked Functions Sparingly

Batch Operations

Minimize Constraint Depth

Memory Management

Profiling