Generate image captions with the Computer Vision API

The Azure Computer Vision API can extract all sorts of interesting information from images — tags describing objects found in the images, locations of detected faces, and more — but today I want to play around with just one: caption generation. I was inspired by @picdescbot on Twitter, which selects random images from Wikimedia Commons and generates a caption using the API. The results are sometimes impressive, sometimes funny, and sometimes bizarre. On the bizarre side, recurring motifs include "lush green field", and "a clock tower on a building", but the bot doesn't discriminate when the confidence score returned by the API is low, so I wanted to take a look at that as well.

I'll use R to interface with the API with the code provided below. You can use the code too, but you'll need an Azure login first. If you don't have one, you can sign for a free Azure account (or Azure for Students, which doesn't require a credit card), and get some free credits to boot. We won't be using any credits in this case, though, as the Computer Vision API has a free pricing tier: it's limited to 5,000 calls a month and 20 calls per minutes, but that's more than sufficient for our needs.

To generate a key for the Computer Vision API, visit the Azure Portal and click "Create a Resource". Select "AI + Cognitive Services" and then "Computer Vision API". Choose a name, your subscription (yours will be different), a data center (choose a region close to you), the pricing tier (F0 is the free tier), and create a new resource group to hold your keys:

It'll take just a moment to set things up, but once that's done click select "Overview" to display the API endpoint, and "Keys" to display the API keys. (Two keys are generated, but you will only need to use Key 1.)

Now launch R, load a couple of packages we'll need, and save the endpoint and API key into R objects as shown below:

library(tools)
library(httr)
vision_api_endpoint <- "https://westus.api.cognitive.microsoft.com/vision/v1.0"
vision_api_key <- "7f1f01ac24064abd80970f41a90237e7"

(Your endpoint may differ depending on the region you chose, and your API key will definitely be different: that one's invalid.) With those two pieces of information, we're all set to go.

Step 1 is to generate the URL of an image from Wikimedia commons. This is possible to do using the Wikimedia API, and easy once you know how. (It took me quite a while and a lot of trial and error to figure out that API, though.) The simple R function below will query the Wikimedia Commons API and return the URL of a random image file. It also checks that the image meets the requirements of the Computer Vision API, and throws an error if not.

You can now call the function to generate the URL of a random image. (If it throws an error, just try again.) It also returns as attrributes the dimensions of the image and the description from the Wikimedia Commons page, which will be interesting to compare to the Computer Vision API generated caption.

> random_image()
[1] "https://upload.wikimedia.org/wikipedia/commons/b/b4/Villa_Malva.jpg"
attr(,"dims")
   w    h 
3072 2304 
attr(,"desc")
[1] "Villa Malva i Ramlösa brunnspark i Helsingborg."

Step 2 is to use the URL generated by that function as input to the Computer Vision API with the second R function below. It uses the global vision_api_endpoint and vision_api_key objects you defined earlier to call the Computer Vision API, requesting the Description (caption). It will also try and identify celebrities and famous landmarks, if it finds them in the image (for example, the generated caption for this image is "TOM CRUISE wearing a suit and tie").

That function prints the URL, caption generated by the Computer Vision API and its confidence score (a value of 0 and 1), along with the Wikimedia Commons description for comparison. Let's give it a go. In each case I simply ran the two lines below.

> u <- random_image()
> image_caption(u)

Here are a few results:

Wikimedia description: Gerfalke (Falco rusticolus) in Westgrönland
Computer Vision API caption (confidence: 92.9%): a bird that is standing in the grass

Wikimedia description: Vista de la casa de la Hacienda Mozanga en 1995.
Computer Vision API caption (confidence: 90.2%): a house with trees in the background

Wikimedia description: Dj Israeli, en un concierto.
Computer Vision API caption (confidence: 44.3%): a man riding on the back of a bicycle

That last example is a good lesson not to trust the captions when the confidence score is low! Let is know on any interesting, funny or strange captions you get in the comments below.

Generate image captions with the Computer Vision API

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112