MLOps Blog

How to Structure and Manage Natural Language Processing (NLP) Projects

6 min
30th August, 2023

If there is one thing I learned working in the ML industry is this: machine learning projects are messy.

It is not that people don’t want to have things organized it is just there are many things that are hard to structure and manage over the course of the project. 

You may start clean but things come in the way. 

Some typical reasons are:

  • quick data explorations in Notebooks, 
  • model code taken from the research repo on github, 
  • new datasets added when everything was already set,
  • data quality issues are discovered and re-labeling of the data is needed,
  • someone on the team β€œjust tried something quickly” and changed training parameters (passed via argparse) without telling anyone about it,
  • push to turn prototypes into production β€œjust this once” coming from the top.

Over the years working as a machine learning engineer I’ve learned a bunch of things that can help you stay on top of things and keep your NLP projects in check (as much as you can really have ML projects in check:)). 

In this post I will share key pointers, guidelines, tips and tricks that I learned while working on various data science projects. Many things can be valuable in any ML project but some are specific to NLP. 

Key points covered: 

  • Creating a good project directory structure
  • Dealing with changing data: Data Versioning
  • Keeping track of ML experiments
  • Proper evaluation and managing metrics and KPIs
  • Model Deployment: how to get it right

Let’s jump in.

Directory structure

Data Science workflow consists of multiple elements:

  • Data, 
  • Models, 
  • Report, 
  • Training scripts, 
  • Hyperparameters, 
  • and so on. 

It is often beneficial to have a common framework consistent across teams. Most likely you’d have multiple team members to work on the same project. 

There are many ways to get started with structuring your Data Science project. You can even create a custom template with some specific requirements of your team. 

However, one of the easiest and quickest ways is to use cookie-cutter template. It automatically generates a comprehensive project directory for you.

β”œβ”€β”€ LICENSE
β”œβ”€β”€ Makefile           <- Makefile with commands like `make data` or `make train`
β”œβ”€β”€          <- The top-level README for developers using this project.
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ external       <- Data from third party sources.
β”‚   β”œβ”€β”€ interim        <- Intermediate data that has been transformed.
β”‚   β”œβ”€β”€ processed      <- The final, canonical data sets for modeling.
β”‚   └── raw            <- The original, immutable data dump.
β”œβ”€β”€ docs               <- A default Sphinx project; see for details
β”œβ”€β”€ models             <- Trained and serialized models, model predictions, or model summaries
β”œβ”€β”€ notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
β”‚                         the creator's initials, and a short `-` delimited description, e.g.
β”‚                         `1.0-jqp-initial-data-exploration`.
β”œβ”€β”€ references         <- Data dictionaries, manuals, and all other explanatory materials.
β”œβ”€β”€ reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
β”‚   └── figures        <- Generated graphics and figures to be used in reporting
β”œβ”€β”€ requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
β”‚                         generated with `pip freeze > requirements.txt`
β”œβ”€β”€           <- Make this project pip installable with `pip install -e`
β”œβ”€β”€ src                <- Source code for use in this project.
β”‚   β”œβ”€β”€    <- Makes src a Python module
β”‚   β”‚
β”‚   β”œβ”€β”€ data           <- Scripts to download or generate data
β”‚   β”‚   └──
β”‚   β”‚
β”‚   β”œβ”€β”€ features       <- Scripts to turn raw data into features for modeling
β”‚   β”‚   └──
β”‚   β”‚
β”‚   β”œβ”€β”€ models         <- Scripts to train models and then use trained models to make
β”‚   β”‚   β”‚                 predictions
β”‚   β”‚   β”œβ”€β”€
β”‚   β”‚   └──
β”‚   β”‚
β”‚   └── visualization  <- Scripts to create exploratory and results oriented visualizations
β”‚       └──
└── tox.ini            <- tox file with settings for running tox; see

As you can see, it covers almost every important component in your workflow – data, docs, models, reports, visualizations.


The Best Tools to Visualize Metrics and Hyperparameters of Machine Learning Experiments

Data versioning

Machine Learning is an iterative process. If you have worked professionally as a data scientist the biggest difference you notice is that the data is not as well-defined as it is in a competition or research benchmark datasets. 

Research datasets are meant to be clean. In research, the objective is to build a better architecture. Better results in a research setting should be attributed to novel architectures and not clever data cleaning hacks.

When it comes to the data that is used in production there is a need to do way more than simply preprocess the data and remove non-unicode characters. There are more serious issues like: 

  • Wrong or inaccurate annotations – a professional data scientist spends a huge amount of time understanding the data generation process as it affects almost any further decisions he makes. One should know the answers to questions like:
    • Who annotates/annotated the data? 
    • Is there a separate team or is it annotated by the users while using the product? 
    • Do you need to have deep domain knowledge to successfully annotate the data? (as is the case with, for example, healthcare-related data)
  • Timeline of the data – Having 1 million rows of data is not useful if 900,000 of them were generated ages ago. In consumer products, user behavior changes constantly with trends or product modifications. A data scientist should ask questions like :
    • How frequently is the data generated? 
    • Are there any gaps in the data generation process (maybe the product feature that generated the data was taken down for a while)? 
    • How do I know if I am not modeling on data that was an old trend (for example in fashion – apparel recommendation)
  • Any biases in the data – Biases in the data can be of all types. A lot of them arise due to ill-defined data collection processes. A few of them are:
    • Sampling Bias –  Data collected does not represent the population data. If the data has an β€˜Age’ feature, bias may lead to overrepresentation of young people.
    • Measurement bias – One part of the data is measured using one instrument and the other part with a different instrument. This can happen in heavy industries where machineries are frequently replaced and repaired.
    • Biases in labels –  Labels in a sentiment analysis task can be highly subjective. This also depends if the label is assigned by a dedicated annotation team or it is assigned by the end user.

Consider a text classification task in NLP and suppose your product works on a global scale. You might collect user comments from all around the world. It would not be practical to assume that the user comments in India would have similar word distribution as that of users in the United States or the UK where the primary language is English. Here, you might want to create a separate region-wise version history.

How is any of this related to your Data Science workflow?

Quite often the data you start with is drastically different from the one you build your final model on. For every change you make in the data, you need to version it. Just like you version control your code with Git. You may want to take a look at Data version control (DVC) for that.

Experiment tracking

Building models is sometimes interesting, but often it is actually pretty boring. Consider building an LSTM (Long Short-Term Memory network) for classification. There’s the learning rate, number of stacked layers, hidden dimension, embedding dimension, optimizer hyper-parameters, and many more to tune. Keeping track of everything can be overwhelming.

To save time, a good data scientist will try to form an intuition of what value of a hyperparameter works and what doesn’t. It is important to keep in mind the metric goals you have formed. What key values you may want to track?

  • Hyper-parameters
  • Model size (for memory constraints)
  • Inference time
  • Gains over baseline
  • Pros and cons (if the model supports out of vocabulary words (like fasttext) or not (like word2vec)
  • Any useful comment (for example – used a scheduler with a high initial learning rate. Worked better than using a constant learning rate.

Often, it is tempting to try more and more experiments to squeeze every ounce of accuracy from the model. But in the business setting (as opposed to a kaggle competition or a research paper) once the metrics are met, the experimentation should take a pause. 

Neptune’s simple API lets you track every detail of your experiment which can be analyzed effectively through its UI. I used Neptune for the first time and it took me a few minutes to start tracking my experiments. 

You can view your experiments, filter them by hyperparameter values and metrics, and even query – β€œmomentum = 0.9 and lr < 0.01”

Neptune logs your .py scripts, makes interactive visualizations of your loss curves (or any curve in general), and even measures your CPU/GPU utilization.

Another great part is all of this becomes even more useful when you are working in teams. Sharing results and collaborating on ideas is surprisingly simple with Neptune.

And the best part – It has a free  individual plan that allows users to store up to 100 GB with unlimited experiments (public or private) and unlimited notebook checkpoints. 

Read also

Switching From Spreadsheets to and How It Pushed My Model Building Process to the Next Level

Examining model predictions (error analysis)

The next step includes a deep-dive error analysis. For example, in a sentiment analysis task (with three sentiments – positive, negative, neutral), asking the following questions would help:

  • Create a baseline: Creating a baseline before diving into experimentation is always a good idea. You don’t want your BERT model to marginally perform better than a TF-IDF + Logistic classifier. You want it to blow your baseline out of the water. Always compare your model with the baseline. Where does my baseline perform better than the complex model? Since baselines are generally interpretable, you might get insights into your black box model too.
  • Metrics analysis: What is the precision and recall for each class? Where are my misclassifications β€˜leaking’ towards? If the majority misclassifications for negative sentiment are predicted as neutral,  your model is having trouble differentiating these two classes. An easy way to analyze this is to make a confusion matrix.
  • Low confidence predictions analysis: How do examples where the model is correct but the confidence of classification is low look like? In this case, the minimum probability of a predicted class can be 0.33 (β…“):
    •  If the model predicts correctly with 0.35, check those examples and see if they are really hard to identify. 
    • If the model predicts an obviously positive remark like β€˜I am so happy for the good work I have done’ correctly with probability 0.35, something is fishy. 
  • Explanation frameworks: You may also look into frameworks like LIME or SHAP for explaining your model predictions.
  • Look at length vs metric score: If your sentences in the training data have high variability in lengths, check if there is a correlation between the misclassification rate and length.
  • Check for biases: Are there any biases in the model? For example, if training on tweets, does the model behave differently towards racial remarks? A thorough inspection of training data is needed in this case. Information on the internet contains hate-speech. However, the model shouldn’t learn such patterns. A real-life example is Tay, a twitter bot developed by Microsoft learned the patterns of tweets and started making racial remarks in just 24 hours

If your model is not performing well over the baseline, try to identify what could be the issue:

  • Is it because the quality or the quantity of labeled data is low? 
  • Do you have more labeled data or unlabeled data that you can annotate? There are a lot of open-source annotation tools available for text data annotation – like Doccano
  • If you don’t have any data, can you use any off the shelf model or use transfer learning?

Answering these critical questions requires you to analyze your experiments really carefully.

Evaluating an unsupervised NLP model

As a special case, let’s discuss how you would evaluate an unsupervised NLP model. Let’s consider a domain-specific language model. 

You already have a few metrics to measure the performance of language models. One of them is perplexity. However, most of the time, the purpose of language models is to learn a high-quality representation of the domain vocabulary. How do you measure if the quality of the representations is good?

One way is to use the embeddings for a downstream task like classification or Named Entity Recognition (NER). See if you use limited data, do you preserve the same level of performance as your LSTM which is trained from scratch?

Model deployment

Although model deployment comes after the model is trained, there are a few points you need to think about right from the start. For example:

  • Do I need near-real-time inference? In some applications like Ads targeting, ads need to be shown as soon as the user lands on the page. So the targeting and ranking algorithms need to work in real-time.
  • Where would the model be hosted? – cloud, on-premise,  edge device, browser? If you are hosting on-premise, a large chunk of building the infrastructure is on you. Cloud has multiple services to leverage in the infrastructure deployment. For example, AWS provides Elastic Kubernetes Service (EKS), serverless trigger functions like Lambda and Sagemaker to create model endpoints. It is also possible to add an auto-scaling policy in plain EC2 server instances as well so that appropriate resources are provided when needed.
  • Is the model too big? If the model is large, you might want to look into post-training quantization. This reduces the precision level of model parameters to save some computation time and reduce model size.
  • Do you want to deploy the model on a CPU or GPU server?

Generally, it is not a good practice to directly replace the existing model with the new one. You should perform A/B testing to verify the sanity of the model. You may also want to take a look at other approaches like canary deployment or champion-challenger setups


I hope you found new ideas for acing your next NLP project. 

To summarize, we started with why it is important to seriously consider a good project management tool, what does a Data Science project consist of – Data Versioning, Experiment tracking, Error analysis and managing metrics. Lastly, we concluded with ideas around successful model deployment.

If you enjoyed this post, a great next step would be to start building your own NLP project structure with all the relevant tools. Check out tools like:

Thanks and happy training!

Was the article useful?

Thank you for your feedback!