Mapillary's computer vision technology

Mapillary is a lot more than geotagged images that have been uploaded to a server. We use computer vision to extract a vast amount of data from the images.

Our computer vision technology stack is made up of two main components: Structure from Motion and semantic segmentation.

Structure from Motion: creating smooth image transitions

Mapillary uses a technology called Structure from Motion (SfM) to create and reconstruct places in 3D. By matching points between different images, SfM is able to locate the point in a three-dimensional space and therefore determine its location on the map.

The more images there are available for a specific point, the more accurately it can be reconstructed. In addition, SfM creates the kind of smooth transitions between images that you can see in the Mapillary viewer—again, provided that images have been taken within close proximity with enough overlap between them.

As long as you capture imagery every few meters, your images will have enough overlap and you will end up with seeing something like the below sequence (press play!):

Additionally, in places where 360° coverage isn't available, the Mapillary image viewer features what we call "combined panning," which lets you pan between overlapping regular images, just like you would with a full panorama, simply by dragging your mouse. Try it out on the imagery below!

Semantic Segmentation: object extraction technology

We also run semantic segmentation on the images. Semantic segmentation is a machine learning algorithm where a category tag is assigned to each pixel in an image. A cluster of adjacent pixels with the same category is then predicted to be a part of the same object. Using semantic segmentation, we are able to break down what part of the image is likely a road, what part is likely a streetlight, etc.

Semantic segmentation together with 3D reconstruction enables us to extract 3D positions of objects such as traffic signs, and display them on the map.

Our semantic segmentation technology results in the following output data:

Object detections

Object detections are instances of different objects that have been detected in images. Since Mapillary images are geotagged, you can get a dataset of image locations for a particular object that interests you, and use that as a filter to quickly find and look at all images where that object is present.

Map features

Map features entail different objects positioned on the map. If the same object has been detected in multiple geotagged images, we can use triangulation to estimate the location of the object and position it with a latitude and longitude.

We split map features into two categories: points and traffic signs. Mapillary currently generates map features for 42 point features and 1,500+ classes of traffic signs. Due to the sheer amount of traffic sign types, and a need to differentiate between them, we like to think of traffic signs as their own category when referring to map features.

Articles in this section

Structure from Motion: creating smooth image transitions

Semantic Segmentation: object extraction technology

Read more about how our CV technology has grown over the years on the Mapillary Blog:

Articles in this section

Structure from Motion: creating smooth image transitions

Semantic Segmentation: object extraction technology

Read more about how our CV technology has grown over the years on the Mapillary Blog:

Related articles