MLOps Blog

Building and Deploying CV Models: Lessons Learned From Computer Vision Engineer

Alessandro Lamberti

10 min

27th July, 2023

Computer Vision ML Model Development

With over 3 years of experience in designing, building, and deploying computer vision (CV) models, I’ve realized people don’t focus enough on crucial aspects of building and deploying such complex systems.

In this blog post, I’ll share my own experiences and the hard-won insights I’ve gained from designing, building, and deploying cutting-edge CV models across various platforms like cloud, on-premise, and edge devices. We’ll dive deep into the essential lessons, tried-and-tested techniques, and real-world examples that will help you tackle the unique challenges you expect to face as a Computer Vision Engineer.

Hopefully, at the end of this blog, you will know a bit more about finding your way around computer vision projects.

Practical considerations for building CV models

Data pre-processing and augmentation

Data pre-processing and augmentation are essential steps to achieving high performance.

Data pre-processing

Preparing the data is a crucial step in the CV pipeline, as it can significantly impact your model’s performance. While resizing images, normalizing pixel values, and converting images to different formats are essential tasks, there are other, more nuanced considerations to keep in mind based on the specific problem at hand.

Critical lessons

Handling varying aspect ratios: resizing images to a fixed size might distort the aspect ratio and affect the model’s ability to recognize objects. In such cases, consider padding images or using techniques like random cropping during data augmentation to maintain the original aspect ratio while still providing input of consistent dimensions to the network.
Domain-specific preprocessing: for certain tasks, domain-specific preprocessing can lead to better model performance. For example, in medical imaging, techniques like skull stripping and intensity normalization are often used to remove irrelevant background information and normalize tissue intensities across different scans, respectively.

Data augmentation

Data augmentation is essential for boosting the size and diversity of your dataset.

Over the years, I’ve refined my approach to augmentation, and here’s what I typically consider as my go-to strategy.

Critical lessons

Basic augmentations: I always start with simple techniques like rotation, flipping, and brightness/contrast adjustments. These methods are computationally inexpensive and often provide significant improvements in model generalization.
Advanced augmentations: depending on the complexity of the task and the dataset’s diversity, I may opt for more advanced augmentation methods like MixUp and CutMix. These techniques combine multiple images or labels, encouraging the model to learn more robust features. I usually reserve these methods for cases where the dataset is limited or when basic augmentations don’t yield the desired improvements in performance.

While advanced augmentations can help improve model performance, obtaining a more diverse dataset is often the best approach. A diverse dataset better represents real-world conditions and provides a broader range of examples for the model to learn from. I usually prioritize acquiring diverse data, and if that’s not feasible, I then explore advanced augmentation techniques to make the most of the available data

Building accurate and efficient computer vision models

Building an accurate and efficient CV model involves several key considerations:

Selecting the right architecture

It is crucial to choose the appropriate model architecture for your specific task. Popular architectures include CNNs, region-based convolutional networks (R-CNN), and YOLO (You Only Look Once). For instance, YOLO is an excellent choice for real-time object detection due to its speed and efficiency. It works well when you require a balance between detection accuracy and computational resources.

However, it may not always be the best choice when dealing with small objects or when high precision is required. In such cases, models like Faster R-CNN or RetinaNet may be more suitable, despite the slower processing time.

Selection of the right CV model architecture | Source

Critical lessons

When starting a new object detection project, my usual baseline is, to begin with, a pre-trained model and fine-tune it on the target dataset. I typically consider YOLOv4 or YOLOv5 for their balance of speed and accuracy (I highly recommend Ultralytics’s repository for its quick set-up and ease of usage).

Ultralytics’s repository (building accurate and efficient computer vision models) — Ultralytics’s repository | Source

Fine-tuning allows for faster convergence and better performance, especially when the new dataset is similar to the one used for pre-training

Optimizing hyperparameters

Optimizing hyperparameters is crucial for achieving optimal model performance. However, not everyone has access to large-scale infrastructure for conducting extensive hyperparameter searches. In such cases, you can still optimize hyperparameters effectively by combining practical experience, intuition, and a more hands-on approach.

Critical lessons

When working with vision models, you typically need to optimize hyperparameters like learning rate, batch size, number of layers, and architecture-specific parameters. Here are some practical tips for optimizing these hyperparameters without relying on extensive searches:

Learning rate: start with a common value, such as 1e-3 or 1e-4, and monitor the learning curve during training. If the model converges too slowly or exhibits erratic behavior, adjust the learning rate accordingly. I often employ learning rate schedulers like reducing the learning rate on plateau to improve convergence.

Batch size: choose a batch size that maximizes GPU memory utilization without causing out-of-memory errors. Larger batch sizes can help with generalization but may require longer training times. If you encounter memory limitations, consider using gradient accumulation to simulate larger batch sizes.

Number of layers and architecture-specific parameters: begin with a well-established architecture, like ResNet or EfficientNet, and fine-tune the model on your dataset. If you observe overfitting or underfitting, adjust the number of layers or other architecture-specific parameters. Keep in mind that adding more layers increases the model’s complexity and computational requirements.

Regularization techniques: experiment with weight decay, dropout, and data augmentation to improve model generalization. These techniques can help prevent overfitting and improve the model’s performance on the validation set.

Managing data quality and quantity: managing data quality and quantity is crucial for training reliable CV models. In my experience, having a systematic approach to curating, maintaining, and expanding datasets has been indispensable. Here’s an overview of my process and some of the tools I use:
- Data preprocessing and cleaning: begin by carefully examining your dataset to identify issues like duplicate images, mislabeled samples, and low-quality images. I highly recommend checking out fastdup to help you identify and manage wrong labels, outliers, bad quality/corrupted images, and more.
- Annotation and labeling: accurate annotations and labels are essential for supervised learning. I prefer using annotation tools like LabelMe, labelImg, or Roboflow for creating bounding boxes, masks, or keypoints. These tools offer a user-friendly interface and support various annotation formats that you can export.
- Data augmentation: to increase the diversity of the dataset and improve model generalization, I apply data augmentation techniques like rotation, flipping, scaling, and color jittering. Libraries like imgaug, albumentations, and torchvision.transforms provide a wide range of augmentation methods to choose from, making it easier to experiment and find the best set of augmentations for your specific task.

Learn more

→ Best MLOps Tools For Your Computer Vision Project Pipeline

→ Building MLOps Pipeline for Computer Vision: Image Classification Task [Tutorial]

Fine-tuning

Model fine-tuning and Transfer Learning have become essential techniques in my workflow when working with CV models. Leveraging pre-trained models can save significant training time and improve performance, particularly when dealing with limited data.

Critical lessons

Over the years, I’ve refined my approach to fine-tuning, and here are some key learnings:

Layer freezing and learning rate scheduling: when fine-tuning, I often freeze the initial layers of the pre-trained model and only update the later layers to adapt the model to the specific task. However, depending on the similarity between the pre-trained model’s task and the target task, I may also employ differential learning rates, where the earlier layers have a smaller learning rate and the later layers have a higher one. This allows for fine-grained control over how much each layer updates during fine-tuning.

Choosing a robust backbone: over time, I’ve found that ResNet and EfficientNet architectures have proven to be the most robust and adaptable backbones for various computer vision tasks. These architectures balance accuracy and computational efficiency, making them suitable for a wide range of applications.

Right choice of the best computer vision model

Throughout my experience, I have worked on a wide range of applications for CV models. Some of the most notable ones include the following.

Facial recognition and analysis

Used in security systems and smartphone unlocking, facial recognition models have come a long way in terms of accuracy and efficiency. While convolutional neural networks (CNNs) are commonly used in smaller-scale facial recognition systems, scaling to a larger number of faces requires a more sophisticated approach.

Critical lessons

Instead of using a standard classification CNN, I found that employing deep metric learning techniques, such as triplet loss, enables models to learn more discriminative feature representations of faces. These embeddings are often combined with vector databases (e.g, ElasticSearch, Pinecone) to enable more efficient indexing and retrieval.

Object detection

Object detection models are commonly used in retail, manufacturing, and transportation industries to identify and track objects within images and videos. Examples include detecting products on store shelves, identifying defects in manufacturing, and tracking vehicles on the road.

Recent advances in real-time object detection, such as single-shot multi-box detectors (SSD) and YOLO (You Only Look Once), have made it possible to deploy these models in time-sensitive applications, such as robotics and autonomous vehicles.

Critical lessons

Here are a few knowledge nuggets from my side on this topic:

In certain scenarios, it may be beneficial to reformat the problem as a classification or segmentation task. For instance, cropping regions of interest from images and processing them separately can lead to better results and computational efficiency, especially when dealing with high-resolution images or complex scenes. Here’s a real-world example:

You’re working on a quality control process for a manufacturing assembly line that assembles printed circuit boards. The goal is to inspect the assembled PCBs for any defects or misplaced components automatically. A high-resolution camera captures images of the PCBs, resulting in large images with small components scattered across the board.

Using an object detection model on the entire high-resolution image may be computationally expensive and less accurate due to the small size of the components relative to the entire image. In this scenario, reformatting the problem can lead to better results and computational efficiency, for example, by segmenting first the regions of interest.

Practical considerations for CV model deployment

Deployment options: cloud, on-premise, and edge

Each deployment option has its benefits and drawbacks, and the choice will highly depend on your project requirements. Here are the most popular ones.

Cloud deployment

Cloud deployment has been a game-changer for deploying computer vision models, offering flexibility, scalability, and ease of maintenance.

Over the past three years, I’ve learned valuable lessons and refined my approach to cloud deployment:

Critical lessons

Default stack: my go-to stack for cloud deployment typically consists of TensorFlow or PyTorch for model development, Docker for containerization, and sometimes Kubernetes for orchestration. I also leverage built-in cloud services to handle infrastructure, automatic scaling, monitoring, and more.
Common pitfalls and how to avoid them:
- Underestimating resource usage: when deploying to the cloud, it’s crucial to properly estimate the required resources (CPU, GPU, memory, etc.) to prevent performance bottlenecks. Monitor your application and use auto-scaling features provided by cloud platforms to adjust resources as needed.
- Cost management: keeping track of cloud expenses is crucial to avoid unexpected costs. Set up cost monitoring and alerts, use spot instances when possible, and optimize resource allocation to minimize costs.

But here’s my biggest learning: embrace the managed services provided by cloud platforms. They can save a significant amount of time and effort by handling tasks such as model deployment, scaling, monitoring, and updating. This allows you to focus on improving your model and application rather than managing infrastructure.

On-premise deployment

On-premise solutions provide increased control over data security and reduced latency but may require more resources for setup and maintenance.

Critical lessons

This option is ideal for organizations with strict security policies or those dealing with sensitive data (like medical imaging or records) that cannot be stored or processed in the cloud. So if you have such prerequisites around your data, on-premise deployment may be the way to go for you.

Edge deployment

Deploying models on edge devices, such as smartphones or IoT devices, allows for low-latency processing and reduced data transmission costs. Edge deployment can be particularly useful in scenarios where real-time processing is essential, such as autonomous vehicles or robotics.

However, edge deployment may impose limitations on available computational resources and model size, necessitating the use of model optimization techniques to fit within these constraints.

Critical lessons

In my experience, moving from a cloud-trained model to an edge-ready model often involves several optimization steps:

Model pruning: this technique involves removing less important neurons or weights from the neural network to reduce its size and complexity. Pruning can significantly improve inference speed and reduce memory requirements without compromising performance.
Quantization: quantizing the model’s weights and activations can reduce memory usage and computational requirements by converting floating-point weights to lower-precision formats, such as int8 or int16. Techniques like post-training quantization or quantization-aware training can help maintain model accuracy while reducing its size and computational complexity.
Knowledge distillation: a compression technique that makes it possible to train a small model by transferring knowledge from a bigger, more complex model. In this regard, make sure to check out my hands-on guide.
Model architecture: selecting an efficient model architecture specifically designed for edge devices, such as MobileNet or SqueezeNet, can improve performance while minimizing resource consumption.
Hardware-specific optimization: optimize your model for the specific hardware it will be deployed on, such as using libraries like TensorFlow Lite or Core ML, which are designed for edge devices like smartphones and IoT devices.

Ensuring scalability, security, and performance

When deploying computer vision models, it is essential to consider the following factors.

Scalability

Ensuring that your deployment solution can handle increasing workloads and user demands is crucial for maintaining system performance and reliability.

Critical lessons

Throughout my experience, I have identified several key factors that contribute to successful scalability in CV model deployment.

Load balancing: distributing the workload across multiple servers or instances can help prevent bottlenecks and maintain system responsiveness. In one of my computer vision projects, implementing a load balancer to distribute incoming requests to multiple instances of the deployed model significantly improved performance during peak usage times.
Auto-scaling: cloud providers often offer auto-scaling features that automatically adjust resources based on demand. By configuring auto-scaling rules, you can ensure optimal performance and cost efficiency. In one of my cloud deployments, setting up auto-scaling based on predefined metrics helped maintain smooth performance during periods of fluctuating demand without the need for manual intervention.

Security

Safeguarding sensitive data and complying with industry regulations is a top priority when deploying computer vision models.

Critical lessons

Based on my experience, I have developed a default stack and checklist to ensure the security of the deployed systems.

Encryption: implement encryption both at rest and in transit to protect sensitive data. My go-to solution for encryption at rest is using AES-256, while for data in transit, I typically rely on HTTPS/TLS.
Access controls: set up role-based access controls (RBAC) to restrict access to your system based on user roles and permissions. This ensures that only authorized personnel can access, modify, or manage the deployed models and associated data.
Federated learning (when applicable): in situations where data privacy is of utmost concern, I consider implementing federated learning. This approach enables models to learn from decentralized data without transferring it to a central server, protecting user privacy.
Secure model storage: store your trained models securely, using a private container registry or encrypted storage, to prevent unauthorized access or tampering.

Performance

Optimizing model performance is crucial to ensure that your computer vision models deliver efficient and accurate results. To achieve this, I’ve learned to focus on several key aspects, including reducing latency, increasing throughput, and minimizing resource usage.

Critical lessons

Besides the learnings I’ve shared above, here are some performance-related learnings I’ve gathered over the years:

Hardware acceleration: utilize hardware-specific optimizations to maximize performance. For instance, TensorRT can be used to optimize TensorFlow models for deployment on NVIDIA GPUs, while OpenVINO can be employed for Intel hardware. Additionally, consider using dedicated AI accelerators like Google’s Edge TPU or Apple’s Neural Engine for edge deployments.
Batch processing: increase throughput by processing multiple inputs simultaneously, leveraging the parallel processing capabilities of modern GPUs. However, keep in mind that larger batch sizes may require more memory, so find a balance that works best for your hardware and application requirements.
Profiling and monitoring: continuously profile and monitor your model’s performance to identify bottlenecks and optimize the system accordingly. Use profiling tools like TensorFlow Profiler to gain insights into your model’s execution and identify areas for improvement.

Model conversion, deployment setup, testing, and maintenance

Successfully deploying a computer vision model involves several key steps.

Model conversion

Converting your trained model into a format suitable for your chosen deployment platform is essential for ensuring compatibility and efficiency. Over the years, I’ve worked with various formats, such as TensorFlow Lite, ONNX, and Core ML. My preferred format depends on the target hardware and deployment scenario.

Critical lessons

Here’s a brief overview of when I choose each format:

TensorFlow Lite: this is my go-to format when deploying models on edge devices, especially Android smartphones or IoT devices. TensorFlow Lite is optimized for resource-constrained environments and offers good compatibility with a wide range of hardware, including GPUs, CPUs and TPUs.
ONNX: when working with different deep learning frameworks like PyTorch or TensorFlow, I often choose the Open Neural Network Exchange (ONNX) format. ONNX provides a seamless way to transfer models between frameworks and is supported by various runtime libraries like ONNX Runtime, which ensures efficient execution across multiple platforms.
Core ML: for deploying models on Apple devices like iPhones, iPads, or Macs, I prefer using the Core ML format. Core ML is specifically designed for Apple hardware and leverages the power of the Apple Neural Engine.

Ultimately, my choice of model format depends on the target hardware, the deployment scenario and the specific requirements of the application.

Deployment setup

Configuring your deployment environment is crucial for smooth operation, and it includes setting up the necessary hardware, software, and network settings.

Critical lessons

Over the years, I’ve experimented with various tools and technologies to streamline the process, and here’s the stack I currently prefer:

Docker: I rely on Docker for containerization, as it helps me package my model and its dependencies into a portable, self-contained unit. This simplifies deployment, reduces potential conflicts, and ensures consistent performance across different platforms.
FastAPI: for creating a lightweight, high-performance REST API to serve my models, I use FastAPI. It’s easy to work with, supports asynchronous programming, and offers built-in validation and documentation features.
Built-in cloud tools: for things like monitoring and CI/CD. Depending on the specific requirements of the CV project, I also consider using more specialized tools like Seldon or BentoML for model serving and management. However, the stack mentioned above has proven to be robust and flexible.

Testing

Thorough testing in the deployment environment is crucial to ensure your model performs as expected under various conditions, such as varying loads and data inputs.

Critical lessons

Over the years, I’ve developed a systematic approach to computer vision testing and managing my models in production:

Test suites: I create comprehensive test suites that cover different aspects of the deployment, including functionality, performance, and stress tests. These test suites are designed to verify the model’s behavior with diverse data inputs, validate its response times, and ensure it can handle high-load scenarios. I use tools like pytest for writing and managing my test cases, and I integrate them into my Continuous Integration (CI) pipeline to have them run automatically.

Some errors to avoid, which I learned from past experiences, include:

Insufficient testing coverage: make sure to cover all relevant test scenarios, including edge cases, to catch potential issues before they affect users.
Ignoring performance metrics: track and analyze key performance metrics to identify bottlenecks and optimize your deployment. It’s necessary to monitor everything you think might help identify issues.
Deploying changes without a rollback strategy: always have a rollback strategy in place to quickly revert to the previous version in case of unexpected issues.
- Tip: when rolling out updates or changes to my models, I employ canary deployments to gradually introduce the new version to a small percentage of users.

Maintenance

Regularly monitor your model’s performance, update it with new data, and address any emerging issues or bugs. Establish a monitoring and logging system to track model performance metrics, such as accuracy, latency, and resource utilization. Additionally, implement a robust alerting mechanism to notify relevant stakeholders in case of performance degradation or unexpected issues.

Critical lessons

Here are some of the tools I often use:

TensorBoard: a tool specifically designed for TensorFlow, TensorBoard enables you to visualize and monitor various aspects of your models during training and deployment. TensorBoard can help you analyze model performance, visualize network architecture, and track custom metrics related to your CV tasks.
ELK Stack (Elasticsearch, Logstash, Kibana): the ELK Stack is a popular log management and analytics solution that can be used to collect, store, and analyze logs from your CV models and deployment environment. Kibana, the visualization component of the stack, allows you to create custom dashboards for monitoring and troubleshooting
Built-in cloud tools: like – for example – AWS CloudWatch, a monitoring service provided by Amazon that allows you to collect, visualize, and analyze metrics and logs from your applications and infrastructure.

Bookmark for later

Deploying Computer Vision Models: Tools & Best Practices

Continuous learning and improvement

Your job is not finished once your CV model is deployed, in fact, in many ways, it has just started.

Critical lessons

Staying current and continuously improving your models requires a commitment to the following practices:

Monitoring for model drift: continuously monitor your model’s performance and retrain it with fresh data to account for changes in the underlying data distribution. Employ techniques like online learning, which allows the model to learn incrementally from new data without retraining from scratch, or ensemble learning, where multiple models are combined to increase robustness against drift.

Testing and validation: rigorously test your models using various validation techniques, such as cross-validation and holdout sets, to ensure their reliability and robustness. Employ model explainability tools, like SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations), to gain insights into model predictions and identify potential biases or weaknesses.

Keeping up with the latest research: stay informed about the latest developments in computer vision research and incorporate relevant findings into your models. Regularly attend conferences, read research papers, and engage with the computer vision community to stay abreast of new techniques and best practices. Here are some of my favorite resources:
- neptune.ai’s blog: full of very valuable resources, both for theoretical and hands-on concepts.
- neptune.ai’s case studies: knowledge base of practical use cases.
- towardsdatascience.com: always full of comprehensive how-to guides.
- theaisummer.com
- lastly, big tech blogs: whether it’s META, Google, DeepMind, NVIDIA, it’s always good to know what’s going on in these companies.

Conclusion

As computer vision continues to advance and impact various industries and applications, staying up to date with best practices, research, and industry standards is essential for success. Sharing our experiences helps us all contribute to the growth and development of this exciting field.

In this blog post, I delved deeper into the practical knowledge and lessons learned from building and deploying CV models over these years. By evaluating the pros and cons of different architectures and deployment options, understanding trade-offs, and applying best practices discussed in this blog, I hope you will be able to successfully navigate the challenges and maximize the rewards of this technology.