Useful Tips

Resource estimation

The BOPs generated by this framework can be use as a good estimator for the on-chip resource consumption, when the following conditions are met:

latency strategy is used.
reuse_factor is set to 1.
parallel_factor is set to match the number of convolution kernel application count (Everything done in parallel).

If io_parallel is used, resource consumption can be estimated in terms of a linear combination of LUTs and DSPs: $$\mathrm{LUTs}+55\cdot\mathrm{LUTs}\sim\mathrm{BOPs}$$

The factor in front of DSPs is rough, but the final order-of-magnitude estimation is still useful.

If io_stream is used, you will need to add resources used for FIFOs, which cannot be directly estimated from BOPs, and depends on the specific implementation (i.e. ShiftRegister vs. BRAM).

Regarding `#pragma HLS DATAFLOW` in `vivado/vitis`

If you are using io_parallel AND met the above conditions AND and has a colvolution layer in your network, you may get a much larger resource consumption than expected together with terrible latency. In this case, please try changing the #pragma HLS DATAFLOW to #pragma HLS PIPELINE or simply removing it and re-synthesize the code.

Regarding `#pragma HLS INLINE RECURSIVE` in `vivado`

If you are using io_parallel with latency strategy with vivado_hls, you may try adding #pragma HLS INLINE RECURSIVE to your top function. This may reduce the resource consumption for some networks. In many cases, resource consumption can be reduced by $\sim10$%, and latency may or may not be improved.

When use intra-layer heterogeneous quantization

If using latency strategy, it is recommended to use intra-layer heterogeneous weight quantization.

For intra-later heterogeneous activation quantization, if you are using io_parallel with latency strategy, one may enable it. For some networks, this may lead to a huge overfitting, and the resource reduction is not as significant as the weight counterpart.

When using only inter-layer heterogeneous quantization

One is recommended to disable intra-layer heterogeneous weight quantization if and only if the model is planned to be deployed with the resource strategy in hls4ml. When intra-layer heterogeneous quantization is not enabled, this is equivalent to optimizing bitwidths with approximated gradients. The obtained resource may be better or worse than the AutoQKeras counterpart.

When doing this, it is strongly recommended to use only L1 and/or L2 regulation on weights and activations (i.e., set beta=0), as the training time BOPs estimated is not accurate at all and not relevant.