TensorFlow Lite for Microcontrollers on Arm(R) Ethos(TM)-U55, U65, and U85 NPUs
Overview
A sample application that demonstrates how to run an inference using the TFLM framework and the Arm Ethos-U NPU.
The sample application runs a model that has been downloaded from the Arm model zoo. This model has then been optimized using the Vela compiler.
Vela takes a tflite file as input and produces another tflite file as output, where the operators supported by Arm Ethos-U NPU have been replaced by an Ethos-U custom operator. In an ideal case the complete network would be replaced by a single Ethos-U custom operator.
Generating Vela-compiled model
Follow the steps below to generate Vela-compiled model and test input/output data. Use keyword_spotting_cnn_small_int8 model in this sample:
Note
The default Vela-compiled model is to target Arm Ethos-U55 NPU and 128 MAC on MPS3 target. Because one model can add up to hundreds of KB, don’t attempt to add more models into code base for other targets.
Downloading the files below from keyword_spotting_cnn_small_int8:
cnn_s_quantized.tflite
testing_input/input/0.npy
testing_output/identity/0.npy
Optimizing the model for Arm Ethos-U NPU using Vela
Assuming target Arm Ethos-U55 NPU and 128 MAC:
$ vela cnn_s_quantized.tflite \ --output-dir . \ --accelerator-config ethos-u55-128 \ --system-config Ethos_U55_High_End_Embedded \ --memory-mode Shared_Sram
Removing unnecessary header
testing_input/input/0.npyandtesting_output/0.npyhave 128-byte header. They must be removed for integration with this sample.$ dd if=testing_input/input/0.npy of=testing_input/input/0_no-header.npy bs=1 skip=128 $ dd if=testing_output/identity/0.npy of=testing_output/identity/0_no-header.npy bs=1 skip=128
Converting to C array
$ xxd -c 16 -i cnn_s_quantized.tflite cnn_s_quantized.tflite.h $ xxd -c 16 -i cnn_s_quantized_vela.tflite cnn_s_quantized_vela.tflite.h $ xxd -c 16 -i testing_input/input/0_no-header.npy testing_input/input/0_no-header.npy.h $ xxd -c 16 -i testing_output/identity/0_no-header.npy testing_output/identity/0_no-header.npy.h
Synchronizing to this sample
Synchronize the files below to
keyword_spotting_cnn_small_int8directory in this sample:cnn_s_quantized_vela.tflite.h > model.h
testing_input/input/0_no-header.npy.h > input.h
testing_output/identity/0_no-header.npy.h > output.h
Note
To run non-Vela-compiled model (
CONFIG_TAINT_BLOBS_TFLM_ETHOSU=n), synchronizecnn_s_quantized.tflite.hinstead.
Building and running
Add the tflite-micro module to your West manifest and pull it:
west config manifest.project-filter -- +tflite-micro
west update
This application can be built and run on any Arm Ethos-U NPU capable platform, such as Corstone(TM)-300 or Corstone-320.
Run target prerequisites
When using the CMake run target (-t run), set the FVP binary path for
your platform and, for Arm Ethos-U85 NPU, pass the NPU configuration to the CLI:
# Arm Ethos-U55/U65 NPU (Corstone-300 FVP)
export ARMFVP_BIN_PATH=/path/to/FVP_Corstone_SSE-300_<ver>/models/Linux64_GCC-<gcc>/
# Arm Ethos-U85 (Corstone-320 FVP)
export ARMFVP_BIN_PATH=/path/to/FVP_Corstone_SSE-320_<ver>/models/Linux64_GCC-<gcc>/
export ARMFVP_EXTRA_FLAGS='-C;mps4_board.subsystem.ethosu.num_macs=2048' # match chosen ETHOS_U85_* Kconfig
Build and run on the FVP
# From the root of the zephyr repository
west build -b mps3/corstone300/fvp samples/modules/tflite-micro/tflm_ethosu -- -DCONFIG_SAMPLE_TFLM_ETHOSU_MEM_MODE_SHARED_SRAM=y
-DCONFIG_ETHOS_U55_128=y
west build -t run
Vela memory modes and overlays
Vela supports several Arm Ethos NPU memory placement modes. This sample reproduces
those at build time using CONFIG_SAMPLE_TFLM_ETHOSU_MEM_MODE_*:
CONFIG_SAMPLE_TFLM_ETHOSU_MEM_MODE_SHARED_SRAMCONFIG_SAMPLE_TFLM_ETHOSU_MEM_MODE_DEDICATED_SRAMCONFIG_SAMPLE_TFLM_ETHOSU_MEM_MODE_SRAM_ONLY
Board overlays
Use overlays to place TensorFlow Lite buffers in matching memory regions:
boards/mps3_corstone300_fvp.sram_only.overlayboards/mps4_corstone320_fvp.sram_only.overlayboards/mps3_corstone300_fvp.dedicated_sram.overlayboards/mps4_corstone320_fvp.dedicated_sram.overlay
Dedicated_Sram additionally requires an ETHOSU_FAST region declared in
the overlay.
Example builds
Shared_Sram (default):
# From the root of the zephyr repository
west build -b mps3/corstone300/fvp samples/modules/tflite-micro/tflm_ethosu -- -DCONFIG_SAMPLE_TFLM_ETHOSU_MEM_MODE_SHARED_SRAM=y
-DCONFIG_ETHOS_U55_128=y
west build -t run
Sram_Only (Arm Ethos-U65 NPU on Corstone-300):
# From the root of the zephyr repository
west build -b mps3/corstone300/fvp samples/modules/tflite-micro/tflm_ethosu -- -DCONFIG_SAMPLE_TFLM_ETHOSU_MEM_MODE_SRAM_ONLY=y
-DCONFIG_ETHOS_U65_256=y
-DDTC_OVERLAY_FILE=boards/mps3_corstone300_fvp.sram_only.overlay
west build -t run
Dedicated_Sram (Arm Ethos-U85 NPU on Corstone-320):
# From the root of the zephyr repository
west build -b mps4/corstone320/fvp samples/modules/tflite-micro/tflm_ethosu -- -DCONFIG_SAMPLE_TFLM_ETHOSU_MEM_MODE_DEDICATED_SRAM=y
-DCONFIG_ETHOS_U85_2048=y
-DDTC_OVERLAY_FILE=boards/mps4_corstone320_fvp.dedicated_sram.overlay
west build -t run
Note
Dedicated_Sram is not supported on Arm Ethos-U55 NPU.
Ensure the model’s Vela mode matches the selected build mode.
Memory regions
The Arm Ethos-U NPU command streams address memory regions using region indices (0-3),
which correspond to REGIONCFG values in the NPU configuration.
Vela assigns these indices when compiling the model, and the HAL maps
each index to a physical AXI interface or MEM_ATTR entry depending on the NPU.
The mapping used in this sample is defined by the selected
CONFIG_SAMPLE_TFLM_ETHOSU_MEM_MODE_* option. The table below shows the
region index assignments implemented by this sample’s CMake and Kconfig logic:
Mode |
weights |
scratch |
fast_scratch |
cmd_stream |
|---|---|---|---|---|
Shared_Sram |
3 |
0 |
0 |
3 |
Dedicated_Sram |
3 |
3 |
0 |
3 |
Sram_Only |
0 |
0 |
0 |
0 |
These index values are applied through zephyr_compile_definitions() as
NPU_REGIONCFG_n and NPU_QCONFIG macros during the build, ensuring the
HAL and driver use a consistent region configuration:
NPU_REGIONCFG_0: weightsNPU_REGIONCFG_1: scratchNPU_REGIONCFG_2: fast scratchNPU_QCONFIG: command stream region
These indices map to hardware interfaces as follows:
Architectural mapping
Arm Ethos-U55/U65 NPU: Regions 0-1 route to AXI0 (on-chip SRAM), 2-3 to AXI1 (external memory). This mapping is fixed by the architecture.
Arm Ethos-U85 NPU: Regions 0-3 select programmable
MEM_ATTR[0-3]entries. By default, the HAL configures MEM_ATTR[0-1] to map to SRAM and MEM_ATTR[2-3] to external DDR. These defaults can be overridden if the system uses a custom memory topology.
When using Dedicated_Sram, ensure an ETHOSU_FAST node exists in the
devicetree overlay to provide a valid region for the NPU’s fast_scratch area.
Configuring PMU Events via CMake
If CONFIG_SAMPLE_TFLM_ETHOSU_PMU is enabled, the sample reports
per-inference PMU counters. You can override which events are counted by setting
ETHOSU_PMU_EVENT_0..3 (and 4-7 if supported).
# From the root of the zephyr repository
west build -b mps3/corstone300/fvp samples/modules/tflite-micro/tflm_ethosu -- -DCONFIG_SAMPLE_TFLM_ETHOSU_MEM_MODE_SHARED_SRAM=y
-DCONFIG_SAMPLE_TFLM_ETHOSU_PMU=y
-DETHOSU_PMU_EVENT_0=ETHOSU_PMU_AXI0_RD_DATA_BEAT_RECEIVED
-DETHOSU_PMU_EVENT_1=ETHOSU_PMU_AXI1_RD_DATA_BEAT_RECEIVED
west build -t run
PMU report format
When enabled, the sample prints a short PMU report after each inference:
Ethos-U PMU report:
ethosu_pmu_cycle_cntr : 134869
ethosu_pmu_cntr0 : 133577
ethosu_pmu_cntr1 : 0
ethosu_pmu_cntr2 : 111744
ethosu_pmu_cntr3 : 0
# (if supported, counters 4..7 are also printed.)
Timing Adapters (TA)
The Timing Adapters allows simulation of memory bandwidth and latency limits on
supported platforms (FVP or FPGA). Apply TA overlays explicitly with
-DDTC_OVERLAY_FILE=.... TA overlays can be combined with any Vela memory
mode to explore timing effects.
For complete TA documentation, see the Evaluation Kit docs.
TA overlay with Sram_Only (Corstone-320)
# From the root of the zephyr repository
west build -b mps4/corstone320/fvp samples/modules/tflite-micro/tflm_ethosu -- -DCONFIG_SAMPLE_TFLM_ETHOSU_MEM_MODE_SRAM_ONLY=y
-DCONFIG_ETHOS_U85_2048=y
-DDTC_OVERLAY_FILE="boards/mps4_corstone320_fvp.ta.overlay;boards/mps4_corstone320_fvp.sram_only.overlay"
west build -t run
TA overlay only (Corstone-300)
# From the root of the zephyr repository
west build -b mps3/corstone300/fvp samples/modules/tflite-micro/tflm_ethosu -- -DCONFIG_SAMPLE_TFLM_ETHOSU_MEM_MODE_SHARED_SRAM=y
-DCONFIG_ETHOS_U55_128=y
-DDTC_OVERLAY_FILE=boards/mps3_corstone300_fvp.ta.overlay
west build -t run
Trademarks
Arm, Ethos, and Corstone are registered trademarks or trademarks of Arm Limited (or its subsidiaries or affiliates) in the US and/or elsewhere.
TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.