MLOps Blog

Create a Face Recognition Application Using Swift, Core ML, and TuriCreate

8 min
Aymane Hachcham
25th April, 2023

Facial recognition technologies have been up for a while now, involved in always more applications that for sure revolutionize our lives. Applications relying on such technologies assure the end customer advanced trustworthiness for data privacy and security. Although some recent ethical controversies like Clearview AI have largely echoed the possible threats of public facial identification, people are constantly eager to learn and understand how the technology works. 

Nowadays, leading tech industries like Google, Facebook, or Apple provide third-party software to help developers quickly build and iterate over products that use these technologies to disrupt the market and help shape a futuristic era. A clear example of the latter is Apple. Publishing in recent months a major update for its Vision API, their main framework for all things related to computer vision. 

The Vision API includes features such as:

  • Native Face detection API
  • Face tracking with ARkit
  • Text and Barcode recognition 
  • Image Registration
  • Vision allows the use of custom Core ML models for all sorts of imagagin tasks

In this piece we’ll try to understand a bit more about those technologies by taking a look at the following:

Note: You can find the code for the whole project in my Github repo

Read also

Top Tools to Run a Computer Vision Project

A tour of Apple’s Core ML framework

Core ML is Apple’s machine learning framework that enables developers to deploy powerful ML models on-device by taking full advantage of a unified representation for all models. 

More specifically, Core ML is designed to render optimized performance for on-device experience allowing developers to pick from a wide variety of ML models that they can deploy on Apple Hardware which already comes with dedicated Neural engines and ML accelerators. 

Frameworks and libraies
Core ML unified interface, credit coremltools official documentation

How to carry ML deployment on device

Before exposing the new features in Core ML 3.0, I want to explain the different steps to export a trained model from Pytorch or Tensorflow to Core ML and eventually deploy it in an IOS app. 

Core ML pipeline | Source: Apple Developer documentation

What the Core ML documentation recommends is to use a python package that eases the migration from third-party training libraries such as TensorFlow and PyTorch to the Core ML format. 

With the coremltools package you can:

  • Easily convert the weights and the structure of trained models from third-party libraries
  • Optimize and hyper-tune Core ML models
  • Verify macOS conversion leveraging Catalyst and Core ML

It is clear that not all models are supported but within each update they try to add support for even more neural architectures, linear models, ensemble algorithms etc. The current supported libraries and Frameworks you can find in their official documentation website is as follows:

Model category
Supported packages

Neural Networks

Tensorflow 1rnTensorflow 2rnPytorch (1.4.0+)rnKeras (2.0.4+)rnONNX (1.6.0)rnCaffe

Ensemble Algorithms

XGBoostrnSci-kit Learn

Generalized Linear Models

Sci-kit Learn

SVMs

LIBSVM

Data pipelines (post and pre processing)

Sci-kit Learn

Conversion example using Pytorch

To illustrate how you can easily take advantage of coremltools and convert a trained Pytorch model to Core ML format, I’ll present a simple hands-on example on how you can convert a MobileNetV2 model from the torchvision library using TorchScript and torch.jit.trace to quantize and compress the model weights.

Note: The code for this example can be found on the coremltools official documentation page

Steps for model conversion:

  1. Load a pre trained version of MobileNetV2 and set it to evaluation mode
  2. Generate the Torchscript object using torch.jit.trace module
  3. Convert the TorchScript object to Core ML using coremltools

First, you’ll need to install the coremltools python package:

Use Anaconda as recommended by the official documentation

  • Create a conda virtual environment:
conda create --name coreml-env python=3.6
  • Activate your conda virtual environment:
conda activate coreml-env
  • Install coremltools from conda-forge
conda install -c conda-forge coremltools 

 Or use pip and virtualenv package:

  • Install virtualenv package:

sudo pip install virtualenv 

  • Create your environment:
virtualenv coreml-env
  • Activate the virtual env and install coremltools:
source coreml-env/bin/activate
pip install -u coremltools

Load a pre-trained version of MobileNetV2

Use torchvision library to import a MobileNetV2 version trained on ImageNet.

import torch
Import torchvision

mobile_net  = torchvision.models.mobilenet_v2(pretrained=True)

Set the model to evaluation mode:

mobile_net.eval()

Generate the Torchscript object using torch.jit.trace

Torch jit trace module takes an input example with the exact same tensor dimensions that the model usually takes as input. Tracing correctly records only those functions and modules that are not dependent on the data (e.g. no conditionals on the data in tensors) and that have no untracked external dependencies (e.g. performing I/O or accessing global variables).

Trace giving random data:

import torch

# Random input with tensor dimensions that match the input model 
input = torch.randn(1, 3, 224, 224)
mobile_net_traced = torch.jit.trace(mobile_net, input)

Download class labels from a separate file:

import urllib
label_url = 'https://storage.googleapis.com/download.tensorflow.org/data/ImageNetLabels.txt'
class_labels = urllib.request.urlopen(label_url).read().decode("utf-8").splitlines()

class_labels = class_labels[1:] # remove the first class which is background
assert len(class_labels) == 1000

Convert the TorchScript object to Core ML format using coremltools

The conversion to Core ML format is made possible thanks to the Unified Conversion API.

import coremltools as ct
# Convert to Core ML using the Unified Conversion API
model = ct.convert(
    mobile_net_traced,
    inputs=[ct.ImageType(name="traced_input", shape=input.shape)]
    classifier_config = ct.ClassifierConfig(class_labels)
)

model.save("MobileNetV2.mlmodel")

The MLModel extension encapsulates a Core ML model’s prediction methods, configuration, and model description. As you saw, the coremltools package helps you convert trained models from a variety of training tools into Core ML models.

Core ML internal ML tools

From what we’ve discussed so far we learned how Core ML works and how easy it is to convert models from third-party libraries to Core ML format. Now, let’s take a look at how you can build, train and deploy an ML model using the internal Apple’s AI ecosystem. 

The two main tools that apple integrates in their whole ML framework are:

  1. Turi Create 
  2. Create ML

Turi Create 

This should be your go for if you want to quickly iterate through model implementation for tasks such as system recommendations, object detection, image segmentation, image similarity or activity classification. 

What is incredibly useful with Turi Create is that it already defines pretrained models for each task that you can fine-tune with your custom datasets. Turi-Create enables you to build and train your model using python and then just export it to Core ML for use in IOS, macOS, watchOS, and tvOS apps.

Machine Learning Task
Description

System Recommendation

Personalize and customize user choices

Image Classification

Label and classify images

Drawing Classification

Recognize drawings and gestures

Sound Classification

Recognize and classify sounds

Object detection

Classify and detect objects in a scene

Style Transfer

Stylize images and videos

Activity Classification

Detect and categorize an activity using sensors

Image similarity

Find similarities between images

Classifiers

Predict labels

Regression

Predict numeric values

Clustering

Group similar data points in a unsupervised manner

Text Classification

Analyze sentiment analysis

Create ML

Unlike Turi create, Create ML enables users to build and train their ML models without writing much code. Create ML available on macOS gives a graphical interface where you can drag and drop your training data and select the kind of model you want to train (speech recognition, image classification, object detection, etc.)

Create ML interface
Create ML interface | Source: Introduction to Apple’s Core ML 3.0

New Features in Core ML 3.0

In the 2019 WWDC conference, Apple made a few interesting announcements about Core ML and the new features on board. I’ll give you a quick summary about the new enhancements in case you have missed it.

  1. On-device training 

The most exciting feature introduced in Core ML 3.0 by far is the possibility to train the deployed models directly on-device. Before that, we only had on-device inference, which basically means that we train the models on other machines and then utilize the trained model to make predictions on the device.  

With on-device training you can perform transfer learning or online learning where you get to tweak your existing model for improved performance and sustainability over time. 

  1. They included new types of Neural Network layers 

They primarily focus on layers for intermediate operations like Masking, Tensor manipulations, Control Flow and Boolean logic.

Core ML 3.0 layers
New layers added to Core ML 3.0 | Source: Apple official documentation 

Feel free to watch the WWDC 2019 video if you are curious about all updates.

Face recognition and the Apple Vision API

Apple’s Vision framework aims to give a high-level API that encompasses ready to use complex computer vision models. Their latest release in 2019 includes exciting features and improvements showcasing again that on-device machine learning models are a huge part of their mobile arsenal and they surely value it a lot.

In Apple’s own words:

Vision is a new powerful and easy-to-use framework that provides solutions to computer vision challenges through consistent interface. Understand how to use Vision to detect faces, compute facial landmarks, track objects and more.

The vision API can be approached looking a three main sections:

1. Request: When you ask the framework to analyse the actual scene and it gives you back any detected objects being found. It is referred to as a request for analysis. Different kinds of request are handled by multiple API classes:

  • VNDetectFaceRectanglesRequest: Human face detection
  • VNDetectBarcodesRequest: Barcode detection 
  • VNDetectTextRectanglesRequest: Visible text region within an image
  • VNCoreMLRequest: Request to use Core ML capabilities for image analysis
  • VNClassifyImageRequest: Request for image classification
  • VNDetectFaceLandmarksRequest: Request to analyze a human face and detect specific topological regions like nose, mouth, lips etc. Based on models trained with data containing computed face landmarks
  • VNTrackObjectRequest: Real-time object tracking inside a video scene.

2. Request Handler: Analyze and perform the request you have triggered. It handles all the related intermediate stuff that occurs between the moment you send a request and the moment it gets performed. 

  • VNImageRequestHandler: Handles requests for image analysis
  • VNSequenceRequestHandler: Handles requests for real-time object tracking, they focus on keeping track of the various image sequences or frames generated when making a video for example.

3. Observations: The results yielded back by the request are wrapped into Observation classes, each referring to the corresponding request type.

  • VNClassificationObservation: Classification information resulting from image analysis
  • VNFaceObservation: Aims specifically at detecting faces.
  • VNDetectedObjectObservation: For object detection.
  • VNCoreMLFeatureValueObservation: A collection of key-value information resulting from prediction of image analysis with a Core ML model.
  • VNHorizonObservation: Determine the angle and the horizon of the object in the scene.
  • VNImageAlignmentObservation: Detect the transformations needed to align the content of two images.
  • VNPixelBufferObservation: An output image resulting from the processing of embedded Core ML models.

Train a face recognition model with Turicreate

We’ll be training an image classifier to detect and recognize our proper face leveraging Turi-reate pretrained version of resnet-50. The idea is to perform some transfer learning with a curated human facial dataset and then export the model to Core ML for on-device deployment.

Setup

To follow along you will need to have Python 3.6 and anaconda installed in your system.

Then will create a conda virtual environment and install turicreate 5.0.

  • Create a conda virtual environment:
conda create --name face_recog python=3.6
  • Activate your conda environment:
conda activate face_recog
  • Install turicreate 5.0
pip install turicreate==5.03

Collect and segregate the training data

In order to train our classifier we’ll need some samples of our face and other samples of things that do not correspond to human faces like images of animals, physical objects, etc. Ultimately, we’ll need to create two data folders containing Images of our face and the remaining images.

To collect images of our face we can just take photos of ourselves using the front camera of the phone. We can get the other images from ImageNet or any other provider like-wise. 

image-collection
Source: Author 

Data augmentation 

Data augmentation is helpful as the photo taken from the front-facing camera during scanning may have different lighting, exposure, orientation, cropping, etc and we want to account for all the scenarios.

To augment our data we’ll be relying on a very useful python package called Augmentor, completely available on Github. 

With Augmentor we can apply a wide range of random data augmentations such as rotate, zoom, shear or crop. We’ll create a data-processing function that we’ll take care of all the transformations.

import Augmentor as augment

def data_processing(root_dir: str):
    data = augment.Pipeline(root_dir)
    data.rotate(probability=0.7, max_left_rotation=10, max_right_rotation=10)
    data.zoom(probability=0.5, min_factor=1.1, max_factor=1.5)
    data.skew(probability=0.5, magnitude=0.5)
    data.shear(probability=0.5, max_shear_left=10, max_shear_right=10)
    data.crop_random(probability=0.5, percentage_area=0.9, randomise_percentage_area=True)
    data.sample(1500)

Augmentor will generate 1500 additional samples of our transformed facial data. 

Model training

We’ll create a simple python script in our virtual environment where we call turicreate resnet-50 pretrained model and train it with the respective data we’ve collected.

  1. Load the images from the training folder
  2. Create target labels from the folder names: aymane-face / not-aymane-face
  3. Fine-tune the model with the new data
  4. Export the trained model to a Core ML format.
import turicreate as tc
import os

data = tc.image_analysis.load_images('Training Data', with_path=True)

data['label'] = data['path'].apply(lambda path: os.path.basename(os.path.dirname(path)))

model = tc.image_classifier.create(data, target='label', model='resnet-50', max_iterations=100)

model.export_coreml('face_recognition.mlmodel')

The model will start training and it displays the epoch results throughout the way.

Displaying training stats on the terminal

Might be useful

Check how you can monitor your model training live in Neptune.

Build the IOS application

We will be building a small IOS application that detects and recognizes my face in a front camera stream. The application will trigger the front facing camera of my iphone and it will perform real-time face recognition using the turicreate model we’ve trained previously. 

Open XCode and create a single view application. The general UX for the application is fairly simple with two ViewControllers:

  • Entry point ViewController that defines a minimalistic layout with a custom button to activate the front camera
  • A CameraViewController that manages the camera stream and performs real-time inference to recognize my face.

Set up the layout

Let’s get rid of the Main storyboard file, as I always prefer to code all my applications programmatically without relying on any XML at all.

  • Delete the main storyboard file, change the info.plist file to remove the Storyboard Name and edit the SceneDelegate file:
var window: UIWindow?
func scene(_ scene: UIScene, willConnectTo session: UISceneSession, options connectionOptions: UIScene.ConnectionOptions) {
    guard let windowScene = (scene as? UIWindowScene) else { return }
    window = UIWindow(frame: windowScene.coordinateSpace.bounds)
    window?.windowScene = windowScene

    window?.rootViewController = LayoutViewController()
    window?.makeKeyAndVisible()
}

Design the layout for the entry point LayoutViewController centering the logo image of the application on the top most part and setting the button that navigates to the CameraViewController slightly beneath it. 

Application Mockup, Source: Author
  • The logo image:
let logo: UIImageView = {
    let image = UIImageView(image: #imageLiteral(resourceName: "faceRecognition").resized(newSize: CGSize(width: screenWidth - 20, height: screenWidth - 20)))
    image.translatesAutoresizingMaskIntoConstraints = false
   return image
}()
  • Tap to recognize button:
let faceRecognitionButton: CustomButton = {
        let button = CustomButton()
        button.translatesAutoresizingMaskIntoConstraints = false
        button.addTarget(self, action: #selector(handleFaceRecognition), for: .touchUpInside)
        button.setTitle("Object detection", for: .normal)
        let icon = UIImage(systemName: "crop")?.resized(newSize: CGSize(width: 50, height: 50))
        button.addRightImage(image: icon!, offset: 30)
        button.backgroundColor = .systemPurple
        button.layer.borderColor = UIColor.systemPurple.cgColor
        button.layer.shadowOpacity = 0.3
        button.layer.shadowColor = UIColor.systemPurple.cgColor

        return button
    }()
  • The ViewController Layout:
override func viewDidLoad() {
        super.viewDidLoad()
        view.backgroundColor = .systemBackground
        addButtonsToSubview()
    }

fileprivate func addButtonsToSubview() {
    view.addSubview(logo)
    view.addSubview(faceRecognitionButton)
}

fileprivate func setupView() {
    logo.centerXAnchor.constraint(equalTo:  self.view.centerXAnchor).isActive = true
    logo.topAnchor.constraint(equalTo: self.view.safeAreaLayoutGuide.topAnchor, constant: 20).isActive = true

    faceRecognitionButton.centerXAnchor.constraint(equalTo: view.centerXAnchor).isActive = true
    faceRecognitionButton.widthAnchor.constraint(equalToConstant: view.frame.width - 40).isActive = true
    faceRecognitionButton.heightAnchor.constraint(equalToConstant: 60).isActive = true
    faceRecognitionButton.bottomAnchor.constraint(equalTo: openToUploadBtn.topAnchor, constant: -40).isActive = true
}
  • Handle Face Recognition method:
@objc func handleFaceRecognition() {

       let controller = FaceRecognitionViewController()

       let navController = UINavigationController(rootViewController: controller)

       self.present(navController, animated: true, completion: nil)
    }

The Face Recognition ViewController

This ViewController takes a live camera preview and triggers the model to perform real-time inference on each and every frame the camera stream yields. We should be extra careful when manipulating each video frame because we can quickly overload the available resources due to real-time inference and make the application crash, claiming a memory leak.

  • We’ll set up the camera:
var videoCapture: VideoCapture!
    let semaphore = DispatchSemaphore(value: 1)

    let videoPreview: UIView = {
       let view = UIView()
        view.translatesAutoresizingMaskIntoConstraints = false
        return view
    }()

    override func viewWillAppear(_ animated: Bool) {
        super.viewWillAppear(animated)
        self.videoCapture.start()
    }

    override func viewWillDisappear(_ animated: Bool) {
        super.viewWillDisappear(animated)
        self.videoCapture.stop()
    }

    // MARK: - SetUp Camera preview
    func setUpCamera() {
        videoCapture = VideoCapture()
        videoCapture.delegate = self
        videoCapture.fps = 30
        videoCapture.setUp(sessionPreset: .vga640x480) { success in

            if success {
                if let previewLayer = self.videoCapture.previewLayer {
                    self.videoPreview.layer.addSublayer(previewLayer)
                    self.resizePreviewLayer()
                }
                self.videoCapture.start()
            }
        }
    }

In order to keep a number of frames per second steady during camera set up it is recommended to lower the resolution and the video quality to: 30 FPS and 640×480.

Instantiating the model 

We need to instantiate the Core ML model (face_recognition.mlmodel) we obtained previously and start making predictions. The idea is to trigger the model by feeding it the frames. The model is expected to return a MultiArray object that encapsulates the bounding box. The final steps will be to predict, parse the object, and draw a box around the face.

func initModel() {
    if let faceRecognitionModel = try? VNCoreMLModel(for: face_recognition().model) {
        self.visionModel = visionModel
        request = VNCoreMLRequest(model: visionModel, completionHandler: visionRequestDidComplete)
        request?.imageCropAndScaleOption = .scaleFill
    } else {
        fatalError("fail to create the model")
    }
}
  • Implement the VideoCaptureDelegate to launch model inference.
extension FaceRecognitionViewController: VideoCaptureDelegate {
    func videoCapture(_ capture: VideoCapture, didCaptureVideoFrame pixelBuffer: CVPixelBuffer?, timestamp: CMTime) {
        // the captured image from camera is contained on pixelBuffer
        if !self.isInferencing, let pixelBuffer = pixelBuffer {
            self.isInferencing = true
            // make predictions
            self.predictFaces(pixelBuffer: pixelBuffer)
        }
    }
}
  • Define the prediction function that performs inference on each frame.
extension FaceRecognitionViewController {
    func predictFaces(pixelBuffer: CVPixelBuffer) {
        guard let request = request else { fatalError() }
        # vision framework configures the input size of image following our model's input configuration automatically which is 416X416
        self.semaphore.wait()
        let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer)
        try? handler.perform([request])
    }
  • Finally, in the post processing phase draw a box on each prediction. 
extension FaceRecognitionViewController { func visionRequestDidComplete(request: VNRequest, error: Error?) { if let predictions = request.results as? [VNRecognizedObjectObservation] { DispatchQueue.main.async { self.BoundingBoxView.predictedObjects = predictions self.isInferencing = false } } else { self.isInferencing = false } self.semaphore.signal() } }

Final output

Face recognition app

Conclusion

Apple’s vision API has opened up new possibilities for mobile developers wishing to integrate ML models into their applications. The whole library is designed to be very intuitive and easy to comprehend. No need to carry an important background on Machine Learning to have fun with Core ML and, the variety of tools and features that come right out of the box are very encouraging. 

Apple constantly improves their ML libraries by adding always more support to new architectures and they ensure seamless integration with their hardware.   

You can always improve these models by either improving your dataset or creating your own network and converting it using coremltools.

References