Deep Learning Part 5: Running Pre-trained Deep Neural Networks through Microsoft Cognitive Services APIs on Raspberry Pi 3 & Parrot Drones
by Anusua Trivedi, Microsoft Data Scientist
This blog series has been broken into several parts, in which I describe my experiences and go deep into the reasons behind my choices. In Part 1, I discussed the pros and cons of different symbolic frameworks, and my reasons for choosing Theano (with Lasagne) as my platform of choice. In Part 2, I described Deep Convolutional Neural Networks (DCNN) and how transfer learning and fine-tuning improves the training process for domain-specific images. Part 3 of this blog series is based on my talk at PAPI 2016. In Part 4, I show the reusability of trained DCNN model by combining it with a Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN). We apply the model on ACS fashion images and generate captions for these images.
In this part, we explore AI on the Internet of Things (IOT). In the video below, you can see an intelligent drone, built by two high school students, that can recognize objects and people in real-time using the Microsoft Computer Vision APIs. This post details how we ran pre-trained Deep Neural networks on Raspberry Pi & Parrot Drones to achieve this.
Motivation
Isha Chakraborty and Neelagreev Griddalur are rising juniors at Monta Vista High School and active members of their high school’s FRC Robotics team. They are smart tinkerers who are both really interested in the world of IOT (Internet of Things). They wanted to expand their knowledge on AI & create some AI for IOTs. They read my blogs and approached me to do a fun AI project using either Raspberry Pi or Drones. We ended up working on both! In this blog post, we describe our experiences working together on this AI project. Here we are applying image and object recognition for Raspberry Pi and Drones. We first began our endeavors with the Raspberry Pi, and after succeeding with that we ventured into working with Drones.
Introduction
Current trends in the research have demonstrated that Deep Convolutional Networks (DCNNs) are very effective in automatically analyzing large collections of images and identifying features that can categorize images with minimum error. DCNNs are rarely trained from scratch, as it is relatively uncommon to have a domain-specific dataset of sufficient size. Since modern DCNNs take 2-3 weeks to train across GPUs, it is a costly and time consuming process. We at Microsoft have identified this as a blocker for AI enthusiasts, and we have pretrained some common DCNNs and released them as Microsoft Cognitive Services APIs to help get started easily.Microsoft Cognitive Services API
Microsoft Cognitive Services (formerly, "Project Oxford") are a set of APIs, SDKs and services available to developers to make their applications more intelligent, engaging and discoverable. Microsoft Cognitive Services expands on Microsoft’s evolving portfolio of machine learning APIs and enables developers to easily add intelligent features — such as emotion and video detection; facial, speech and vision recognition; and speech and language understanding — into their applications. For this project we used 2 APIs:
- Microsoft Computer Vision API: We used the “Describe Image” function of the API. This operation generates a description of an image in human readable language with complete sentences. The description is based on a collection of content tags, which are also returned by the operation. More than one description can be generated for each image. Descriptions are ordered by their confidence score. All descriptions are in English.
- Microsoft Bing Speech API: Microsoft's Speech APIs can transcribe speech to text and can generate speech from These APIs enable you to create powerful experiences that delight your users.
- Speech to Text APIs convert human speech to text that can be used as input or commands to control your application.
- Text to Speech APIs convert text to audio streams that can be played back to the user of your application.
For this project we used Text to Speech APIs, which use REST to convert structured text to an audio stream. The APIs provide fast text to speech conversion in various voices and languages.
Raspberry Pi Project Description
For our AI in IOT project, we wanted to start with Raspberry Pi. After reading different interesting articles, we came across Charles Channon’s article in hackster.io. It is a blog with great details on building your own Smart Camera using a Raspberry Pi 3. He also had details on using the Microsoft’s Computer Vision API and using and LCD plate for display of text. He had even attached sample codes in his Github which helped us get started very easily. We wanted to extend this project and used Microsoft Bing Speech API to convert the text to speech.
Our modified code can be found in this Github.
We followed most of his instructions, with the exception adding a few things to really personalize the project:
- Take a picture with Raspberry Pi
- Send that picture to Microsoft computer Vision API.
- Get the generated text description of the image from that API.
- Display the text description on Raspberry Pi LCD Screen
- Convert the text description to Speech.
Also, we ended up buying most of the materials from other websites than what Charles had detailed just due to unavailability and trying to stay on a reasonable budget! Most of the materials were quite inexpensive, so we ordered the Raspberry Pi 3, the Raspberry Camera Module, Adafruit LCD Display, and a SD Card with NOOBS pre installed all from Amazon and we were able to put them to use within a few days. Some extra things that we had lying around that we also used for the project include a speaker with Bluetooth functionalities, a monitor, HDMI cable, a soldering iron, a wireless keyboard, and a wireless mouse.
Experiment
Our first step was to begin setting up the Raspberry Pi, and enabling various functionalities like the i2c bus and the camera once we plugged in the module, mouse, keyboard, and the LCD display. Enabling the i2c bus and the camera can be done from the control panel on the screen on the pi, and checking to see whether it has been enabled can be done by entering lsmod | grep i2c
into the Linux terminal and looking for “i2c 6780
” at the bottom of the text generated.
We also soldered the LCD display so that we were able to display the output from the Raspberry Pi on the screen. Soldering the LCD display can be quite tricky, so we would suggest keeping a multitude of resistors at hand if you aren’t very experienced with soldering, like us. We were able to improvise and use a very similar resistor to the one we had damaged, and still retain the same functionalities.
After that, we implemented Charles’ code from the Github with some necessary modifications to some of the scripts — such as entering sudo crontab -e
and then @reboot sleep 30 ; sudo python /home/pi/ComputerVision.py &
after all of the commented lines. This allowed the automatic enabling of the app to execute when starting up. We would also recommend to update the image path in the code of the picture that is taken by the Pi for more usability.
We also wrote code to implement the Microsoft Bing Speech API to add our own spin on the project; we were then able to project the description of the object onto the LCD display as well as it being spoken from the speaker or device. Incorporating the text-to-speech component was a lot of trial and error because we tried so many different ways of trying to seamlessly integrate both APIs which is why we ended up using the Bing Speech API in the end. The Bing Speech API didn’t have a ton of documentation for Python specifically, so that was a challenge, but there are lots of open-source projects that you can access and find resources to help solve that problem. This is one project that we found helpful.
Using the Microsoft Computer Vision and Bing APIs were an easy choice for us mainly just because of the extensive documentation already present and the many use-cases of it. A demo of the Pi Smart Camera that was presented at the Global IoT Conference (video here).
Drone Project Description
In brief, we have created a drone that can recognize objects in real-time using the Microsoft Computer Vision APIs, and this opens a world of possibilities for the user. Drones, in and of itself, have a variety of applications: surveillance, photography, quality control, and much more. Combining drones with machine learning creates a sample space in which there is potential to solve issues that plague the planet daily such as monitoring the status of power lines, contributing 40% of the electrocution deaths per year, and also providing cost-effective surveillance for many large corporations and protecting the civilians of the city.
The main inspiration of this project is the major advancements in the applications of machine learning, Microsoft Cognitive Services APIs, and Lukas Biewald’s article in O’Reilly.
The Experiment
By using a Parrot 2.0 Power Drone Quadricopter Edition we were able to use a multitude of ar-drone libraries. This allowed us to use and modify many pre-written functions to better suit our needs, such as how we integrated takeoff and other basic drone flight sequences into a web interface.
After setting up the drone, we utilized Telnet to access the drone shell using the open source ardrone-wpa2 project on github at https://github.com/daraosn/ardrone-wpa2, which essentially allows the drone to be configured to a local Wi-Fi network with proper internet access.
To maximize the user experience, we created a interactive Javascript web interface for the drone using node.js and the express.js framework, and set up a function to make AJAX requests using buttons. Doing so allowed us to perform drone functionalities such as taking-off and landing right from the web interface. Using the interface gave us the mobility to control the drone from the computer while monitoring its battery levels and position in air. If we were to navigate from the given mobile application, we would not have been able to process the data from the drone in real time and utilize cloud-based APIs. Our next challenge was to stream video from the drone which we did by sending a continuous stream of PNGs to the web interface using the AR Drone library. In order to maximize accuracy of the Computer Vision API, we slowed the frame rate of the PNGs to 10 frames per second for condensed footage that kept the overall viewing intact while simplifying the process for the API.
To make this into a “smart” drone, we integrated the Microsoft Computer Vision API into the web interface, so that the drone could identify objects in real-time. For this, we wrote a python script that would call the Computer Vision API, upload the image taken by the drone into the server, identify the object, and convert the text to speech for an auditory response using the Bing Speech API.
For another iteration of the project, we used OpenCV in a Python script that would essentially extract the facial landmarks of each trained image and compare it to the people it would see in real life. The images were uploaded similarly as to how the Computer Vision API functioned.
To finalize this project, we wrote shell scripts to automate these processes and increase efficiency.
Challenges
I think one of the most challenging aspects of this project was definitely working with OpenCV, and learning about its many pre written functionalities. However, with tons of help from StackOverflow and open-source projects, we were able to hack our way through the project! However, I would say that OpenCV wasn’t extremely reliable, and the documentation was slightly confusing. I would definitely recommend looking more into the Microsoft Face API as we have heard promising things about it, we only tried using OpenCV for the sake of trying something even more new! Incorporating the Face API will definitely be our next step into solidifying this project.
Code WalkThrough
To access the code in this project and to follow along, refer to https://github.com/antriv/Object_Recognition_Drone.
Once the repository has been cloned, find the README.md file that will also include a detailed description of the project and include some of the necessary dependencies such as node.js, node-ar libraries, and ffmpeg. All of the necessary node modules will be included in this repository, but we would recommend installing node.js regardless.
The following is a detailed description of each core file in this project:
ComputerVision.py
: In this file, there is a local file called “droneimage.png” which the script accesses and passes through the Computer Vision API. The API passes back a result in JSON and using the JSON parser we extracted the text portion of the result. The result is then saved into a variable and then passed through to the Microsoft Bing Text-to-Speech API. The API will be referenced and the result is stored in a .WAV file which the Shell script invokes.
Server.js
: This is the main file to start functioning with the drone using your computer. Using “node server.js” invoke the file using Node.js libraries and an Express.js web server will be launched. In this file, there a few functions which it will listen for the AJAX requests to run. Once each function is invoked, it will pass the function to the Nodecopter libraries (an open-source project) and communicate with the drone.
Start.sh
: This file creates a seamless way to use the ardrone-wpa
scripts and connect to a local network of your choice. This is the first step in starting up the drone. After the drone is started, you can connect to its access point and run this script. It will reconfigure the drone to be connected to the desired network. Currently only 2.4 GHz networks are supported. When you specify an IP, be sure to specify one that is available and REMEMBER to reference that IP in the server.js
file.
Object_detect.sh
: This file should be run while the Node.js server is running. The script basically streamlines the process of retrieving an image from the drone and saving it to droneimage.jpg
, then processes that image in ComputerVision2.py
and then plays the audio file that is generated by the text-to-speech code.
Conclusion
From our overall experience, we learned a lot about the design and development process that reading a tutorial or taking a course cannot teach you. The Raspberry Pi 3 was really easy to use and learn and it gave us the creative license to modify and personalize the project; it is a great introductory microcontroller to become familiar with, and will not limit you in terms of most projects imaginable. The feeling when the object is detected with a strong accuracy and the device speaks it to you is priceless, and can only be experienced if you complete the project yourself. It was a great learning experience, and an exciting introduction to IoT.
In terms of the Drone, it was definitely a much more challenging project to take on, and it led us to learn a lot of new concepts encompassing Machine Learning to improve our project. We think drones are something that will become a part of daily life, and we are grateful for getting the opportunity to innovate on it!
IoT and Artificial Intelligence are fields that is ever-growing now, and it is our job to learn as much as we can to create products that can better humanity, and these projects were the perfect entrance into the field.
Contact: Please feel free to email me if you have questions.
Anusua Trivedi - antriv@microsoft.com