Run these advanced AI models right now on your Rock 5 board with NPU acceleration!

Introduction:

I am thrilled to share a suite of powerful models I’ve optimized specifically for the RK3588 platform. These models leverage cutting-edge technologies to achieve state-of-the-art performance across multiple domains like natural language processing (NLP), computer vision, and speech synthesis, all while ensuring efficient operation on resource-constrained edge devices. Below is an overview of the models and their applications.

1. MiniCPM-V-2_6-rkllm

  • Purpose: A compact multi-modal language model optimized for visual and textual understanding.

  • Features: Supports single image reasoning tasks, achieving GPT-4V level performance with only 8 billion parameters. It excels in task of image analysis and understanding, making it ideal for complex multi-modal applications like surveillance analytics or video-based Q&A.

  • Speed: ~6s per image. (All the speed listed is for single core NPU, the RK3588 has 3 NPU cores, so you can run 3 combanation of these models at the same time)

  • Memory: ~9.7GB.

Link: https://huggingface.co/happyme531/MiniCPM-V-2_6-rkllm

2. Qwen2-Audio-rkllm

  • Purpose: Audio-based language model capable of tasks like speech recognition and audio-text comprehension.

  • Applications: Suitable for audio transcription, smart assistants, and voice-controlled devices.

  • Speed: ~16s per audio clip.

  • Memory: ~11.6GB.

Link: https://huggingface.co/happyme531/Qwen2-Audio-rkllm

3. wd-convnext-tagger-v3-RKNN2

  • Purpose: This is the wd14 tagger commonly used for Stable Diffusion, ported to RKNN. Generate tags for your image library for quick search or classification.

  • Applications: Useful in automated quality control, retail item tagging, and advanced image labeling systems.

  • Speed: ~0.3s per image.

  • Memory: ~0.45GB.

Link: https://huggingface.co/happyme531/wd-convnext-tagger-v3-RKNN2

4. Stable-Diffusion-1.5-LCM-ONNX-RKNN2

  • Purpose: AI-driven image generation optimized for the RK3588 platform.

  • Features: This version supports high-quality image synthesis while being resource-efficient, ideal for creative industries and gaming.

  • Speed: ~16s per 512x512 image.

  • Memory: ~5.6GB.

Link: https://huggingface.co/happyme531/Stable-Diffusion-1.5-LCM-ONNX-RKNN2

5. Segment-Anything-2.1-RKNN2

  • Purpose: Cutting-edge image segmentation model that can isolate and identify objects in images.

  • Applications: Perfect for medical imaging, autonomous driving, and AR applications, or just cut out anything in your image.

  • Speed: ~3s per image.

  • Memory: ~0.95GB.

Link: https://huggingface.co/happyme531/Segment-Anything-2.1-RKNN2

6. SenseVoiceSmall-RKNN2

  • Purpose: SenseVoice is an audio foundation model with audio understanding capabilities, including Automatic Speech Recognition (ASR), Language Identification (LID), Speech Emotion Recognition (SER), and Acoustic Event Classification (AEC) or Acoustic Event Detection (AED).

  • Applications: SenseVoice-small supports multilingual speech recognition, emotion recognition, and event detection for Chinese, Cantonese, English, Japanese, and Korean, with extremely low inference latency.

  • Speed: Recognize ~20s audio per second.

  • Memory: ~1.1GB.

Link: https://huggingface.co/happyme531/SenseVoiceSmall-RKNN2

7. Florence-2-base-ft-ONNX-RKNN2

  • Purpose: Multi-modal foundation model with exceptional performance in vision-language tasks.

  • Applications: Excels in generating captions, object descriptions, and other vision-language integrations for accessibility solutions.

  • Speed: ~4.5s per image.

  • Memory: ~2GB.

Link: https://huggingface.co/happyme531/Florence-2-base-ft-ONNX-RKNN2

8. Bert-VITS2-RKNN2

  • Purpose: An NLP-based text-to-speech model.

  • Features: Provides expressive and nuanced speech synthesis, suitable for audiobooks, chatbots, and virtual assistants.

Link: https://huggingface.co/happyme531/Bert-VITS2-RKNN2

  • Speed: Generate ~3s audio per second.

  • Memory: ~2.3GB.

Usage:

There are documents and example scripts available on each of the model pages to help you get started with these models. All models have been pre-converted to the RKNN format for easy deployment on the RK3588 platform. You can use these models for a wide range of applications, from smart devices to creative projects, with the assurance of high performance and efficiency.

Conclusion:

These models represent a leap forward in edge AI, optimized for the RK3588 to balance high performance and efficiency. Whether you’re developing next-gen smart devices or working on creative applications, this suite provides a robust foundation. I’d love to hear your thoughts or discuss collaboration opportunities in bringing these models to real-world applications!

Feel free to tweak the post further before sharing. Let me know if you’d like additional formatting or refinements!

5 Likes

Thanks from the community, in theory would there be any issue attempting to use these on the RK3566 (Zero 3W/E) ? There are some images that unlock the NPU.

Only RK3588 and RK3576 are supported by rknn-llm.

The RK3566 won’t ever be supported as it only has 1 TOPS NPU and doesn’t have the power to run LLM.