What would you say if, right after a good breakfast with your family, you looked up to see a shark swimming above their heads? Yesterday morning, we had several such visits in my house. Amazingly, we were able to accommodate a quite large tiger on our sofa, a curious turtle on the carpet and a friendly shark under our kitchen’s ceiling. Not to mention the cute pony that was waiting patiently on the terrace for my daughter.
In April 2020, as I sit and write this article, millions of children around the globe are not able to go to school or on outings because they are on lockdown. This has resulted in the need to find new ways to overcome our physical bounds and access knowledge to keep us both mentally healthy and entertained as we wait to be able to return to normal life.
The techniques used to blur the boundaries between the individual and the world around them and enhance the reach of their brains are called cognitive augmentation (CA). Human interfaces for augmented cognition have been studied for a long time as their usage leads to seamless knowledge transfer and improved learning and decision-making processes.
Let’s have a look at the submerged part of the 'iceberg' that brings the AR applications that run on your mobile phone to life.
What’s under Augmented Reality’s hood?
The augmented reality app from Google gave us the opportunity to meet animals up close on a life-sized scale. The rest of the animals (you can find the full list here) will certainly be 'visiting us' during the next couple of days.
The zoologic uncanny valley that I was expecting is successfully bridged by Google’s AR app through an exceptional combination of spatial stability and the quality of the models. The animals are able to ‘sit’ or ‘move’ on planar surfaces (the floor, table, or ceiling), with their 3D models presented without any glitches. The animations are natural and well done. Of course, there is more work to be done on the animals with long fur, such as the lion’s mane, but this is to be expected if you think about the complexity of modelling hair. On the other hand, the flat or smooth surfaces, such as the turtle’s back, look awesome. The real scale is also a great asset for understanding the size of the animal.
Another great AR application is the virtual measuring tape from Apple which allows taking physical measurements of the world using a smartphone.
Technically speaking, AR is based on classical computer vision algorithms with SLAM (Simultaneous Localisation And Mapping) at its core. Such an algorithm compares visual features of the scene from successive camera frames in order to calculate the relative movement between frames. There are already very good SLAM implementations from frameworks like ARKit (Apple), ARCore (Google), MRTK (Microsoft) or the cross-platform AR Foundation (Unity3D), so there is no need to implement SLAM from scratch.
In spite of the advances offered by the available SLAM frameworks, there are more ingredients needed to reach a high-quality AR experience. First, the application has to run at the edge, meaning on devices that are near the source of data (the camera, microphone, etc.) and are not necessarily connected to internet. These devices must be able to process data and take decisions in real time without the lag introduced by the back and forth communication with remote servers. Then, there is an increasing need for intelligent, machine learning specific functionalities that we, as human beings, take for granted. For instance, the recognition of people, objects and even larger scenes, pose estimation, semantic segmentation, motion understanding, visual anomaly detection, text reading, OCR, audio recognition or text translation are just a few examples of areas in which humans expect high performance and reliability from machines. More complex applications rely on algorithms that provide humanlike performance. For instance, semantic segmentation is crucial for placing virtual objects in a scene behind other existing objects.
AR applications are an ideal growth bed for machine learning
With the camera always on, a continuous image stream is produced providing structured data (objects reliably tracked, orientation, displacement, etc.) in a repeatable manner (the same object or features are seen from different angles). Therefore, ML models can be used to boost AR applications and make their use even more natural for us humans.
Just as my kids paid attention to the details of a 3D giant sea turtle and engaged with it for much longer than they would normally do by studying its (2D) picture in a book, AR can be applied in other domains as an enabler for cognitive augmentation. Let’s take a look at a few examples.
At Endava, we used AR for remote collaboration in factories or for creating a fully interactive and explorable virtual motor show. These experiences can be achieved using VR headsets or specific AR devices such as Microsoft’s HoloLens or smart contact lenses with embedded cameras. Similarly, field workers in the surveillance industry can also use AR enhanced glasses to quickly spot potential menaces detected by ML.
The real estate industry can also benefit from the use of AR and VR by helping users understand the configuration of a space or various decoration options through an immersive experience. Pinterest’s tool for placing real objects in virtual scenes enables a visual discovery of items which, in turn, leads to a higher engagement from users. Devices such as the Matterport camera can create a photo realistic 3D model of the environment that can be accessed remotely by visitors. Ikea uses a similar idea for virtually placing furniture in our homes.
All the above examples show useful applications of AR or VR built upon ML predictions. However, even more important is the powerful combination between these technologies which enable the double loop learning. We can collect data about how we interact with our world when we have access to its augmented version and understand the 'why' of our actions based on the patterns driving our decisions.
The AR & ML technology stack
A successful AR application is one that is in the hands of the consumers and is heavily used. A new AR app can help a provider approach new customers, offer bespoke solutions or create new revenue opportunities through direct or cross-sales. However, in order to make it successful on the long run, it’s important to collect data, discover usage patterns and improve upon those. Let’s return to Google’s application that shows 3D animals in AR. What if the 'animal' would be able to detect the emotion on the face of a child and interact accordingly? What if the app could recognise the objects in a room and the cat could jump up onto a table or the dog could try to play with a toy? The sky is the limit for the interactions that the AR assets can have with the objects and the living creatures in the scene.
The edge deployed AR & ML apps are the result of a technology stack as the one shown in Figure 2. The training and inference of the ML models use the computing power provided by the hardware placed at the bottom layer of the stack. The AR applications rely on the edge device’s steam power. The AR assets are placed in the scene as (static or animated) 3D objects whose appearance is boosted by ML models. With training data provided by the initial AR applications, the ML models are either created using available Platforms as a Service (PaaS) or accessed through Software as a Service (SaaS) APIs. New utilisation patterns, discovered during the use of the applications, are fed into the platform to create or improve the ML models.
Figure 2. AR & ML technology stack
One of the main challenges in running ML on the edge is the need of low latency. In the case of AR apps, this is literally visible. Usually a graphic card or an NCS stick, such as Movidius, are used to increase the inference speed by running massively parallel algorithms.
What’s the optimal team shape?
This brief overview reveals the complexity of an eco-system that lives at our fingertips. As data scientists or researchers, we often tend to see a technology or an ML model as a goal in itself. Once our own intricated mix of algorithms and hardware works, in perfectly controlled conditions, we tend to claim victory. Unfortunately, this huge effort, from our perspective, represents only a tiny share of the effort required for production level deployment of a mature application that works out there, in the hands of the merciless users. Nobody will appreciate the robustness of the SLAM algorithm if it fails, without warning, in low light conditions. The users will leave the application if the animals turn upside down because of a gimbal lock error. And no one cares if the app crashes because the load balancer didn’t work properly.
My point here is that, for a successful product, several teams must cooperate along its whole data science lifecycle. Let’s take a quick glance at this process that involves, at least, three teams: A Data Science Team (DS), a Development and Testing team (DT) and an MLOps team. While the teams may have overlapping roles, there are a few quite unique profiles that are needed for a successful project.
First, the business need must be clearly understood with the help the Business Analyst from the DS team. If you aim at delivering real value in the market, make sure that the problem you pick is structured, repeatable and predictable.
Then, the next step is data acquisition. As a rule of thumb, the available data needs to check the 4 V's: Volume, Variety, Velocity and Veracity. The DS team must include a Data Analyst whose role is to provide inspiration from data and help the decision makers avoid the confirmation bias. Since data is the new oil, plan from day 1 how you’ll ramp up the volumes of real data handled by the application and involve a Software Architect from the DT team early in the discussions.
The model’s proof of concept (PoC) is usually built by the Applied ML engineer (part of the DS team) upon experimenting with existing models. Although many times transfer learning works just fine, in some cases the models for solving narrow problems have to be built from scratch by another member of the DS team: The Researcher.
Once the PoC provides satisfying results, it has to leave the Jupyter notebook and move down the pipeline, towards the end users. The engineering skills needed for the productisation, integration, end-to-end testing, monitoring and quality assurance are provided by the collaboration between the DT and DS teams.
In software engineering, deployment at scale, infrastructure, and the monitoring of the live operation is ensured by the DevOps team. MLOps, the equivalent of the DevOps for ML, is the set of practices for a healthy development lifecycle that leads to systems that are operable, manageable and maintainable. Therefore, for successfully deploying an ML enhanced AR application in the wild, we’ll need a cross-functional team with an eye on the end goal and with a tight connection with the end users.
Depending on the complexity of the project, the technical team must also involve roles such as a UX designer, a Data Product Manager, a Graphic Designer, a Project Manager and a Product Owner.
Besides the technical considerations, a successful team needs to also consider the major challenge of the product integration in the 'production line' of the end user. Even for an entertainment application, the users must adopt it and be able to enjoy its benefits. Therefore, a few other roles should be involved: A Domain Expert, an Ethicist, a Philosopher or a Musician among others.
The users of the final applications, like in my kids’ example, are implicit supporters of Moravec’s paradox. They are not shocked by a lion in the living room and expect the animal to navigate between the objects without difficulty. As native digitals, they take for granted that the technology understands them and thinks like humans do. We are truly on a path in which multi-experience replaces technology-literate people with people-literate technology.
The fact that, after a while, my kids left me alone admiring the functionality of the app and imagining the wonderful improvements that could be made prove that it needs ML to be added for a more engaging experience! Or that I’m a hopeless geek.