If you're a fan of the HBO show Silicon Valley, you probably remember the episode where Jian Yang creates an application to identify food using a smartphone phone camera:
Surprisingly, the app in that scene isn't just a pre-recorded special effect: the producers actually developed a smartphone application using Tensorflow (and you can even download the app for your phone to play with). It was an impressive feat, especially give the relative infancy of deep learning tools back in 2016. Things have advanced since then, though, and I wanted to see how easy it would be to build an equivalent to the Not Hotdog application using Microsoft's Custom Vision API. I don't know React, so instead of an phone application I created an R function to classify an image on the Web. As it turns out, the process was fairly simple and I only needed a small set of training images — less than 200 — to build a pretty good classifier.
In the paragraphs below, I'll walk you through the process if you want to try it out yourself. It doesn't have to be hotdogs, either: it should be easy to adapt the script to detect other kinds of images, and even make multiple classifications.
To run the R script, in addition to R (any recent version should work, but I tested it with R 3.4.1) you'll also need an Azure subscription. If you don't have one already, you can get a free Azure account (or Azure for Students, if you're a student without a credit card). The Custom Vision API is part of the "always free" services, so you can run this script without any charges or depletion of your free credits. You'll also need to generate keys for the Custom Vision API, and I describe in README.md how to generate the keys and save them in a keys.txt
file.
You'll also need a set of images to train your classifier. In our case, that means some images of hotdogs. It's also useful to provide a "negative set" of images that are not what you intend to detect, but might be mistaken for it. (In this case, I used images of tacos and hamburgers.) You can source the images however you like, but since the API allows you to provide URLs of images on the Web, that's what I focused on finding. One easy way to find URLs of images is to use ImageNet Explorer and search for one of the provided tags (here called "synsets").
This was an easy way to generate hundreds of URLs of pre-classified images. The only problem is that some of the URLs no longer work, so I used an R script to help me filter the broken URLs and sample from the remainder. (I also rejected a few images with visual inspection that were obviously not representative of the intended class.) I saved the result into files of hotdog urls and similar food urls, so you can skip this step if you like.
The next step was to write an R function to identify an image as a hot dog (or not) given the URL of an image on the web. This requires a bit of setup using the Custom Vision API first, namely:
- Defining a "project" and a "domain" for classification. In addition to classifying general images, you can use pre-trained neural networks to detect things like landmarks or shopping items. I used the "Food" domain, which can better identify hotdogs and other foods.
- Define "tags" to classify the training images. I could have used just a "hotdog" tag, but I also created a "nothotdog" tag for the tacos and hamburgers, which improved the performance of the "hotdog" class detection.
- Pass in the URLs of the training images for each category. (The only trick here is that the API accepts a maximum of 64 URLs at a time, so an R function loops through the list if there are more than that.)
- Train the project using the provided images, and retrieve the ID of the training session ("iteration") for use in the prediction step
One of the nice things about the Custom Image API is that as you're stepping through the R code, you can check your progress at customvision.ai, a web-based interface to the service. You can check the recall and precision of the trained model on the training data, and test out predictions on new images. (You can even tag those test images and incorporate them into new training data.)
Now we're ready to create our prediction function in R:
You could also improve the function (as I did) by checking for invalid URLs, and by converting the probability predictions for the classes into classifications based on a probability threshold. In our case, that's "Hot dog", "Other food" and "Not a hotdog". Let's try some examples:
> hotdog_predict("http://www.hot-dog.org/sites/default/files/pictures/hot-dogs-on-the-grill-sm.jpg") http://www.hot-dog.org/sites/default/files/pictures/hot-dogs-on-the-grill-sm.jpg "Hotdog"
Yep, that's a hotdog. Let's try something else:
> hotdog_predict("https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Burrito_with_rice.jpg/1200px-Burrito_with_rice.jpg") https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Burrito_with_rice.jpg/1200px-Burrito_with_rice.jpg "Not Hotdog (but it looks delicious!)"
Yup, that's a burrito. It's not perfect though: it does misclassify some things (at the 50% probability threshold, anyway):
> hotdog_predict("https://www.recipetineats.com/wp-content/uploads/2017/09/Spring-Rolls-6.jpg") https://www.recipetineats.com/wp-content/uploads/2017/09/Spring-Rolls-6.jpg "Hotdog"
Close, but no cigar-shaped food item. Those are spring rolls.
We could improve the performance by providing more training images (including spring rolls in my negative set would probably have helped with the above misclassification) or by tuning the probability threshold. Let me know how it performs on other images you find. Try it yourself using the scripts and data files provided in the repo below.
Github (revodavid): nothotdog