Skip to main content

Performance Tuning

Packing Size

The packing parameter significantly affects proof size:

  • Smaller values: Faster proving, larger proofs
  • Larger values: Slower proving, smaller proofs

Recommendation: Experiment with different packing sizes (4096, 8192, 16384) to find the optimal trade-off for your application.

Example

# Test with different packing sizes
./webgpu_prover '{"program": "app.wasm", "packing": 4096, "args": []}'
./webgpu_prover '{"program": "app.wasm", "packing": 8192, "args": []}'
./webgpu_prover '{"program": "app.wasm", "packing": 16384, "args": []}'

Compare proof generation time and proof size for each configuration.

GPU Threads

The gpu-threads parameter controls parallelism:

  • Can exceed physical GPU core count
  • Higher values may improve performance on high-end GPUs
  • Start with the default (= packing size) and adjust

Example

# Custom GPU thread count
./webgpu_prover '{
"program": "app.wasm",
"packing": 8192,
"gpu-threads": 16384,
"args": []
}'

Constraint Optimization

Use Checked Functions Sparingly

Checked functions (e.g., bn254fr_addmod_checked) automatically add constraints. Use them only when necessary:

// Good: Manual constraint when needed
bn254fr_addmod(sum, a, b);
// ... more operations ...
bn254fr_assert_add(sum, a, b); // Add constraint at the end

// Less optimal: Immediate constraint for every operation
bn254fr_addmod_checked(sum, a, b); // Adds constraint immediately

Batch Operations

Process multiple values together when possible:

// Less optimal: Process one at a time
for (int i = 0; i < 1000; i++) {
bn254fr_mulmod(results[i], inputs[i], constant);
}

// Better: Use vectorized operations
vbn254fr_t vec_inputs, vec_results;
vbn254fr_alloc(vec_inputs, 1000);
vbn254fr_alloc(vec_results, 1000);
// ... populate vec_inputs ...
vbn254fr_mulmod_constant(vec_results, vec_inputs, &constant);

Minimize Constraint Depth

Keep your constraint graph shallow:

// Deep: Many sequential dependencies
bn254fr_mulmod(t1, a, b);
bn254fr_mulmod(t2, t1, c);
bn254fr_mulmod(t3, t2, d);
bn254fr_mulmod(result, t3, e);

// Shallow: Parallel operations
bn254fr_mulmod(t1, a, b);
bn254fr_mulmod(t2, c, d);
bn254fr_mulmod(result, t1, t2);
bn254fr_mulmod_checked(result, result, e);

Memory Management

Free field elements as soon as possible:

// Good: Free when done
{
bn254fr_t temp;
bn254fr_alloc(temp);
// ... use temp ...
bn254fr_free(temp);
} // temp is freed

// Less optimal: Keep allocated longer than needed
bn254fr_t temp;
bn254fr_alloc(temp);
// ... use temp ...
// ... lots of other code ...
bn254fr_free(temp); // Freed much later

Profiling

Monitor proof generation time and identify bottlenecks:

time ./webgpu_prover '{
"program": "app.wasm",
"args": []
}'

The prover will output performance metrics including:

  • Constraint generation time
  • FFT computation time
  • Total proof generation time