Content

# Tool List ## Project Introduction This project is my final project for the AI1101 course on artificial intelligence interaction technology. I spent about 20 hours completing it. The main functions are as follows: - **Multimodal Interaction**: Utilizing the web interface built with Gradio, users can ask questions through voice, text, and images. MIRA responds in text and voice and provides correct responses for questions related to mechanical arm operations, supporting multi-round conversations and context memory. - **LLM Understanding Instructions**: The multimodal input is preliminarily processed, and then the Qwen LLM (qwen-plus) understands and calls the corresponding tool class to achieve functions such as querying time, weather, location, operating the local SQLite database, operating the mechanical arm, and calling the QwenVL model for image understanding. - **Rich Peripheral Feedback**: In addition to providing text replies through the web interface, MIRA also uses speakers for voice output and communicates with micro:bit via serial port to provide visual and sound effects. - **Qwen Agent Framework**: Using the Qwen Agent framework to write Function Calling and MCP, it avoids lengthy and complex Prompt writing and uncertainty, improving the probability of correctly calling the corresponding tool. ![example1](example_show/Screenshot_2025-05-24_21-07-06.png) ## Usage Examples The following images show MIRA's responses when users input voice and text. For a more detailed demonstration process, please refer to: example_show/演示视频.mp4 ![example2](example_show/Screenshot_2025-05-24_21-05-59.png) ![example3](example_show/Screenshot_2025-05-24_21-06-16.png) ![example4](example_show/Screenshot_2025-05-24_21-06-46.png) ![example5](example_show/Screenshot_2025-05-24_21-05-23.png) ## Environment Preparation Hardware Environment: Orange Pi AI Pro 20T (Ubuntu 22.04), sound card, speaker, microphone, camera, JAKA mechanical arm, microbit v2 Software Configuration: Create an environment using miniconda and install dependent packages 1. Clone the repository ``` git clone https://github.com/johnnyhank/MIRA-Multimodal-Intelligent-Robotic-Assistant.git ``` 2. Install system dependencies ``` sudo apt update sudo xargs apt install -y < system_packages.txt ``` 3. Create a virtual environment ``` conda create -n mira python=3.10.9 ``` 4. Activate the environment ``` conda activate mira ``` 5. Install dependent packages ``` pip install -r requirements.txt ``` **Note:** If there are issues with the environment, please refer to the two reference documents in the **Other Directory Descriptions** for configuration. ## Usage Note that if you want to call the mechanical arm to perform related programs, please follow the steps in the module introduction to calibrate; otherwise, the mechanical arm may not be able to correctly pick up objects. ### Qwen Agent Integrated Based on Function Calling 1. Use the built-in Gradio page (supporting voice and image input) ``` cd 05-samrt-robot python webui_qwen.py ``` 2. Use the WebUI provided by Qwen Agent ``` cd 05-samrt-robot python utils_qwen_agent.py ``` ### LLM Agent Calling External Functions Based on Prompt's Json Generation Method ``` cd 05-samrt-robot python gradio_app.py ``` ## Module Introduction ### 05-samrt-robot: Intelligent Auxiliary Mechanical Arm Assistant Based on Multimodal LLM * API_Key_utils.py: Save all API Keys * calib_bd1127.py: Calculate affine transformation matrix * calib_cam_point.py: Get the coordinates of camera calibration points * calib_grip_test.py: Test the mechanical arm's picking function * calib_robot_point.py: Get the coordinates of mechanical arm calibration points * gradio_app.py: Provide a WebUI for LLM Agent calling external functions based on Prompt's Json generation method * start.py: Provide a command-line operation for LLM Agent calling external functions based on Prompt's Json generation method * utils_agent.py: Define the scheduling logic of the intelligent agent, generate a JSON format action list based on Prompt, and provide example templates. * utils_cam.py: Provide camera-related functions, such as opening the camera for real-time preview, taking pictures, and saving images. * utils_llm.py: Provide large language model (LLM) related functions, including time, weather query, and interface with Amap. * utils_micro_bit.py: Provide functions for communicating with micro:bit devices, including connecting devices, sending data, and disconnecting. * utils_onnx_yolo.py: Implement target detection function based on ONNX model for identifying objects in images and returning their center point coordinates. * utils_qwen_agent.py: Develop an agent using the Qwen Agent framework, define multiple custom tool classes, and realize interaction with mechanical arm, micro:bit, and visual model. * utils_robot.py: Provide basic operation functions of the mechanical arm, such as initialization, moving to a specified position, and executing greeting actions. * utils_spe_rec.py: Provide speech recognition function, including recording and using Baidu speech recognition API to convert audio to text. * utils_tts.py: Provide text-to-speech function, use Edge TTS to synthesize speech, and play it. * utils_vlm.py: Provide visual language model (VLM) related functions, use Tongyi Qianwen VL model for image recognition. * utils_vlm_move.py: Combine visual recognition and mechanical arm control to realize the function of grabbing specific objects according to image recognition results. * webui_qwen.py: Web user interface based on Gradio, support multimodal input (text, voice, image), and integrated with Qwen Agent to provide a more intuitive interaction experience. #### Calibration Program Since the position of the camera may not be fixed, calibration is required (the code in this repository uses nine-point calibration). The general process and involved Python scripts are as follows: 1. Get the coordinates of nine points in the camera coordinate system ``` python 05-samrt-robot/calib_cam_point.py ``` In the pop-up window, click the calibration points in order and record the coordinates of nine points in the camera coordinate system. 2. Get the coordinates of nine points in the mechanical arm coordinate system ``` python 05-samrt-robot/calib_robot_point.py ``` Press and hold FREE to move the mechanical arm to the nine points in order, then press POINT, and record the coordinates of nine points in the mechanical arm coordinate system. (Note that the order of nine points in the camera coordinate system and mechanical arm coordinate system must correspond) 3. Calculate affine transformation matrix ``` python 05-samrt-robot/calib_bd1127.py ``` Record the printed affine transformation matrix and replace the relation_matrix in utils_vlm_move.py with the calculated affine transformation matrix. ### Other Directory Descriptions This project is improved based on https://gitee.com/myronx/OrangePi-SIC. The other modules are basically not modified. * 00-starter-pack: Configuration module, Orange Pi starts to automatically report IP address * 01-voice-interaction：Voice interaction module, voice recognition and speech synthesis * 02-erniebot：Large language module based on Ernie Bot * 03-yolo-om-infer：Visual module, YOLO detection model * 04-jaka-minicobo：Mechanical arm module, manipulate section card mechanical arm * 04.get_tcppos.py can be used to get the corresponding coordinates of the object to be picked up Please see the Orange Pi reference document at: https://notes.sjtu.edu.cn/s/iL4X6eLvz JAKA single-arm robot operation manual: https://notes.sjtu.edu.cn/s/qtKNPFdBT ## Acknowledgments and Reflections During the development of this project, I would like to thank the following individuals and teams for their support and help (in no particular order): Thanks to AI1101 teachers Chu Pengzhi and Xie Mingyue for their careful guidance and valuable suggestions. Thanks to all the materials of AI1101 course, https://gitee.com/myronx/OrangePi-SIC and the relevant tools and frameworks provided by the open-source community, which provided solid technical support for this project. Thanks to the platforms providing related APIs: Baidu AIP（https://cloud.baidu.com/） Wenxin Yiyan（https://aistudio.baidu.com/overview） Tongyi Qianwen（https://help.aliyun.com/zh/model-studio/first-api-call-to-qwen） Seniverse（https://www.seniverse.com/） Amap（https://lbs.amap.com/dev/id/newuser） In addition, I also gained many valuable experiences and reflections during the development of the project: 1. The design of multimodal interaction needs to consider both user experience and technical implementation. 2. The combination of mechanical arm control and visual recognition has high precision requirements and needs coordinate system conversion, and the calibration process is one of the key links. 3. The construction and use of the Qwen Agent framework make all tools available in the same way. In fact, I initially used the method of calling functions through Prompt in https://gitee.com/myronx/OrangePi-SIC, but this method of generating Json through Prompt and parsing function calls is cumbersome and prone to errors, so I decided to use the Qwen Agent framework for reconstruction, making all tools available in the same way. In the future, I hope MIRA can be continuously improved to provide intelligent services for more scenarios!

MIRA-Multimodal-Intelligent-Robotic-Assistant

Content

Connection Info

You Might Also Like

markitdown

OpenAI Whisper

oh-my-opencode

claude-flow

chatbox

ai-engineering-from-scratch

MIRA-Multimodal-Intelligent-Robotic-Assistant

Scan with WeChat to Share

Authentication Required

Content

Connection Info

You Might Also Like

markitdown

OpenAI Whisper

oh-my-opencode

claude-flow

chatbox

ai-engineering-from-scratch