Run these advanced AI models right now on your Rock 5 board with NPU acceleration!

hallo1 · December 18, 2024, 9:41pm

Introduction:

I am thrilled to share a suite of powerful models I’ve optimized specifically for the RK3588 platform. These models leverage cutting-edge technologies to achieve state-of-the-art performance across multiple domains like natural language processing (NLP), computer vision, and speech synthesis, all while ensuring efficient operation on resource-constrained edge devices. Below is an overview of the models and their applications.

1. MiniCPM-V-2_6-rkllm

Purpose: A compact multi-modal language model optimized for visual and textual understanding.
Features: Supports single image reasoning tasks, achieving GPT-4V level performance with only 8 billion parameters. It excels in task of image analysis and understanding, making it ideal for complex multi-modal applications like surveillance analytics or video-based Q&A.
Speed: ~6s per image. (All the speed listed is for single core NPU, the RK3588 has 3 NPU cores, so you can run 3 combanation of these models at the same time)
Memory: ~9.7GB.

Link: https://huggingface.co/happyme531/MiniCPM-V-2_6-rkllm

2. Qwen2-Audio-rkllm

Purpose: Audio-based language model capable of tasks like speech recognition and audio-text comprehension.
Applications: Suitable for audio transcription, smart assistants, and voice-controlled devices.
Speed: ~16s per audio clip.
Memory: ~11.6GB.

Link: https://huggingface.co/happyme531/Qwen2-Audio-rkllm

3. wd-convnext-tagger-v3-RKNN2

Purpose: This is the wd14 tagger commonly used for Stable Diffusion, ported to RKNN. Generate tags for your image library for quick search or classification.
Applications: Useful in automated quality control, retail item tagging, and advanced image labeling systems.
Speed: ~0.3s per image.
Memory: ~0.45GB.

Link: https://huggingface.co/happyme531/wd-convnext-tagger-v3-RKNN2

4. Stable-Diffusion-1.5-LCM-ONNX-RKNN2

Purpose: AI-driven image generation optimized for the RK3588 platform.
Features: This version supports high-quality image synthesis while being resource-efficient, ideal for creative industries and gaming.
Speed: ~16s per 512x512 image.
Memory: ~5.6GB.

Link: https://huggingface.co/happyme531/Stable-Diffusion-1.5-LCM-ONNX-RKNN2

5. Segment-Anything-2.1-RKNN2

Purpose: Cutting-edge image segmentation model that can isolate and identify objects in images.
Applications: Perfect for medical imaging, autonomous driving, and AR applications, or just cut out anything in your image.
Speed: ~3s per image.
Memory: ~0.95GB.

Link: https://huggingface.co/happyme531/Segment-Anything-2.1-RKNN2

6. SenseVoiceSmall-RKNN2

Purpose: SenseVoice is an audio foundation model with audio understanding capabilities, including Automatic Speech Recognition (ASR), Language Identification (LID), Speech Emotion Recognition (SER), and Acoustic Event Classification (AEC) or Acoustic Event Detection (AED).
Applications: SenseVoice-small supports multilingual speech recognition, emotion recognition, and event detection for Chinese, Cantonese, English, Japanese, and Korean, with extremely low inference latency.
Speed: Recognize ~20s audio per second.
Memory: ~1.1GB.

Link: https://huggingface.co/happyme531/SenseVoiceSmall-RKNN2

7. Florence-2-base-ft-ONNX-RKNN2

Purpose: Multi-modal foundation model with exceptional performance in vision-language tasks.
Applications: Excels in generating captions, object descriptions, and other vision-language integrations for accessibility solutions.
Speed: ~4.5s per image.
Memory: ~2GB.

Link: https://huggingface.co/happyme531/Florence-2-base-ft-ONNX-RKNN2

8. Bert-VITS2-RKNN2

Purpose: An NLP-based text-to-speech model.
Features: Provides expressive and nuanced speech synthesis, suitable for audiobooks, chatbots, and virtual assistants.

Link: https://huggingface.co/happyme531/Bert-VITS2-RKNN2

Speed: Generate ~3s audio per second.
Memory: ~2.3GB.

Usage:

There are documents and example scripts available on each of the model pages to help you get started with these models. All models have been pre-converted to the RKNN format for easy deployment on the RK3588 platform. You can use these models for a wide range of applications, from smart devices to creative projects, with the assurance of high performance and efficiency.

Conclusion:

These models represent a leap forward in edge AI, optimized for the RK3588 to balance high performance and efficiency. Whether you’re developing next-gen smart devices or working on creative applications, this suite provides a robust foundation. I’d love to hear your thoughts or discuss collaboration opportunities in bringing these models to real-world applications!

Feel free to tweak the post further before sharing. Let me know if you’d like additional formatting or refinements!

zuex · December 20, 2024, 10:55pm

Thanks from the community, in theory would there be any issue attempting to use these on the RK3566 (Zero 3W/E) ? There are some images that unlock the NPU.

3djelly · December 21, 2024, 7:49am

Only RK3588 and RK3576 are supported by rknn-llm.

The RK3566 won’t ever be supported as it only has 1 TOPS NPU and doesn’t have the power to run LLM.

Dmitry87 · December 25, 2024, 7:10pm

Do you think it would be possible to optimize RIFE frame interpolation to utilize the NPUs? Would RK3588 be powerful enough to interpolate 480p or 576p videos on the fly like M1 does?

hallo1 · December 25, 2024, 7:22pm

I tried it, but the RKNPU2 does not support the GridSample operator used inside RIFE.

Rients_Politiek · February 2, 2025, 9:36am

Surely, at some point in time, the code works.
However, with the latest version MiniCPM-V-2_6-rkllm errors with segmentation faults.

OS: Joshua-Riek Ubuntu 24.04 or
OS: rock-5c_bookworm_kde_b1.output.img.x
OpenCV 4-11.
Numpy 2-2 (installed by opencv-python)
rknpu-toolkit2-lite v2.3.0

See Hugging Face for more info