Making Machine Learning Work on a Microcontroller (Without Blowing Up RAM)

By eeDesignIt Editorial Last updated Apr 7, 2025

Let’s be honest: the words “machine learning” and “microcontroller” used to belong in separate conversations. One was for cloud engineers and data scientists. The other? For people doing serious embedded work—timing critical I/O, keeping power under control, fitting code in 128KB flash.

But lately, those lines have started to blur. Not because the buzzwords got louder, but because the tools actually got better.

We’re talking about real-time ML inference on bare-metal hardware—running on a microcontroller, using a quantized model, with tight latency and power budgets. This is where tinyML lives. And it’s not a research project anymore. Engineers are already deploying these systems—especially for simple classification tasks like anomaly detection or wake-word recognition.

Here’s how it works, why it’s worth your time, and what it takes to do it right.

What does “ML at the edge” even mean?

Let’s skip the slides and get practical. When we say “ML at the edge,” we mean running a trained machine learning model—typically a small neural network—directly on your embedded system without needing cloud access.

No server. No round-trip latency. No reliance on someone else’s API staying online.

The model could be doing things like:

Detecting machine vibration anomalies from accelerometer data
Identifying a spoken word from audio
Classifying gestures from IMU data
Distinguishing between normal and suspicious movement

But here’s the constraint: we’re doing this on a Cortex-M4, not an NVIDIA Jetson. That means:

<100 MHz clock speeds
64KB–512KB RAM (often less)
No floating-point, or maybe a single-precision FPU if you’re lucky

So, running a typical TensorFlow model trained on your desktop is out of the question—unless you strip it down. That’s where quantization comes in.

Quantization: Why your model needs to go on a diet

Quantization is the process of reducing the precision of your model weights and activations. Most ML frameworks train using 32-bit floating-point numbers (FP32), but you can often convert the model to use 8-bit integers (INT8) or even 4-bit in some cases.

Why do it?

Lower memory footprint (smaller model, fewer intermediate buffers)
Faster inference on CPU-bound systems (integer math is much faster than float)
Lower power consumption (shorter compute cycles, less memory access)

But here’s the catch: quantizing your model can destroy its accuracy if you’re not careful. Especially with time-series sensor data, which tends to be noisy and harder to generalize.

To make this work, you’ll need to use quantization-aware training (QAT) or post-training quantization with calibration. And you have to validate the model’s performance on the actual device—not just in Python.

Tools like TensorFlow Lite for Microcontrollers (TFLM), Edge Impulse, and CMSIS-NN help with this. TFLM provides a runtime that can run fully quantized models on resource-constrained hardware. CMSIS-NN provides optimized kernels for inference on Arm Cortex-M CPUs.

Real-World Example: Wake-word detection on an MCU

One application that’s gone from “nice idea” to “real deployment” is on-device keyword spotting—think “Hey Siri” or “Okay Google,” but running locally on a $2 MCU.

Here’s a simplified overview of how engineers are pulling it off:

Hardware: STM32L4 or equivalent, running at 80–120 MHz with ~128KB RAM
Sensor: PDM or analog MEMS microphone with front-end filter
Model: Tiny convolutional neural network (CNN) trained on Mel-frequency cepstral coefficients (MFCCs) of the audio
Model size: ~20KB weights, ~10KB activation buffer
Runtime: TensorFlow Lite for Microcontrollers
Power: <1 mA active, much lower when duty-cycled
Latency: <100 ms response time

The system runs continuously, sampling audio, extracting MFCCs, and passing them to the model. When the model detects the target word, it triggers a higher-level event handler (e.g., wake up the rest of the system).

No Wi-Fi, no cloud latency, and no risk of a server deprecating your API next year.

When tinyML is worth it—and when it’s not

Where it shines:

Event detection with clear features: Repetitive motor noise, pressure spikes, simple spoken words
Power-constrained systems: IoT nodes, wearables, remote sensors
When privacy matters: Processing audio or sensor data locally avoids cloud transmission

Where it struggles:

Multiclass classification with subtle differences: Think image classification with many similar classes—it’s tough without more compute
Anything requiring dynamic input shapes or attention mechanisms: You’re not running GPT-4 on a Cortex-M0
Applications where explainability is critical: It’s harder to debug neural nets than a finite state machine

This isn’t a replacement for traditional signal processing—it’s a complement. If you can solve your problem with a bandpass filter and an envelope detector, do that. But if you’re dealing with a messier signal or need to classify complex patterns, ML might give you better results—and now it fits in your design.

What engineers are doing behind the scenes

If you’re serious about deploying real ML models on embedded hardware, here are a few things you’ll need to dig into:

Training with representative sensor data: Don’t train on clean lab data and expect it to work in a factory or a forest.
Memory layout optimization: Quantized inference needs carefully aligned buffers to avoid cache misses and stalls.
Interrupt-safe inference: If you’re sampling data with interrupts, make sure your inference loop isn’t blocking or race-prone.
Power profiling: Measure how inference affects battery life, especially if you’re running it every 100 ms.

Also, watch out for version mismatches—TensorFlow updates can break compatibility with TFLM. Keep your build toolchain locked down and tested.

Want to build your own? Check out: