MLOps Blog

Knowledge Graphs With Machine Learning [Guide]

11 min
Aravind CR
1st December, 2022

You need to get some information online. For example, a few paragraphs about Usain Bolt. You can copy and paste the information from Wikipedia, it won’t be much work. 

But what if you needed to get information about all competitions that Usain Bolt had taken part in, and all related stats about him and his competitors? And then what if you wanted to do that for all sports, not just running?

Machine learning engineers often need to build complex datasets like the example above to train their models. Web scraping is a very useful method to collect the necessary data, but it comes with some challenges.

In this article, I’m going to explain how to scrape publicly available data and build knowledge graphs from scraped data, along with some key concepts from Natural Language Processing (NLP).

What is web scraping?

Web scraping (or web harvesting) is data scraping used for data extraction. The term typically refers to collecting data with a bot or web crawler. It’s a form of copying in which specific data is gathered and copied from the web, typically into a local database or spreadsheet for later use or analysis.

You can do web scraping with online services, APIs, or you can write your own code that will do it. 

There are two key elements to web scraping:

  • Crawler: The crawler is an algorithm that browses the web to search for particular data by exploring links across the internet.
  • Scraper: The scraper extracts data from websites. The design of scrapers can vary a lot. It depends on the complexity and scope of the project. Ultimately it has to quickly and accurately extract the data.

A good example of a ready-made library is the Wikipedia scraper library. It does a lot of the heavy lifting for you. You provide URLs with the required data, it loads all the HTML from those sites. The scraper takes the data you need from this HTML code and outputs the data in your chosen format. This can be an excel spreadsheet or CSV, or a format like JSON.

Knowledge graph

The amount of content available on the web is incredible already, and it’s expanding at an increasingly fast rate. Billions of websites are linked with the World Wide Web, and search engines can go through those links and serve useful information with great precision and speed. This is in part thanks to knowledge graphs.

Different organizations have different knowledge graphs. For example, the Google Knowledge Graph is a knowledge base used by Google and its services to enhance search engine results with information gathered from a variety of sources. Similar techniques are used in Facebook, or Amazon products for a better user experience, and to store and retrieve useful information. 

There’s no formal definition of a knowledge graph (KG). Broadly speaking, a KG is a  kind of semantic network with added constraints. Its scope, structure and characteristics, and even its uses aren’t fully realized in the process of development.

Bringing knowledge graphs and machine learning (ML) together can systematically improve the accuracy of systems and extend the range of machine learning capabilities. Thanks to knowledge graphs, results inferred from machine learning models will have better explainability and trustworthiness

Bringing knowledge graphs and ML together creates some interesting opportunities. In cases where we might have insufficient data, KGs can be used to augment training data. One of the major challenges in ML models is explaining predictions made by ML systems. Knowledge graphs can help overcome this issue by mapping explanations to proper nodes in the graph and summarizing the decision-making process.

Read also

Data Augmentation in NLP: Best Practices From a Kaggle Master

Another way to look at it is that a knowledge graph stores data that resulted from an information extraction task. Many implementations of KG make use of a concept called triplet — a set of three items (a subject, a predicate, and an object) that we can use to store information about something. 

Yet another explanation: knowledge graphs are a data science tool that deals with interconnected entities (organizations, people, events, places). Entities are nodes connected via edges. KGs have entity pairs that can be traversed to uncover meaningful connections in unstructured data.

Web scraping nodes

Node A and Node B are 2 different entities. These nodes are connected by an edge that represents their relationship. This is the smallest KG we can build – also known as a triple. Knowledge graphs come in a variety of shapes and sizes.

Web scraping, computational linguistics, NLP algorithms, and graph theory (with Python code)

Phew, that’s a wordy heading. Anyway, to build knowledge graphs from text, it’s important to help our machine understand natural language. We do this with NLP techniques such as sentence segmentation, dependency parsing, parts-of-speech (POS) tagging, and entity recognition.

The first step to build a KG is to collect your sources — let’s crawl the web for some information. Wikipedia will be our source (always check the sources of data, a lot of information online is false).

For this blog, we’ll be using the Wikipedia API, a direct Python wrapper. Neptune to manage model building metadata in a single place. Log, store, display, organize, compare and query all your MLOps metadata.

Experiment tracking and model registry built for research and production teams that run a lot of experiments.

Installation and setup

Install dependencies and scrape data

!pip install wikipedia-api neptune-client neptune-notebooks pandas spacy networkx scipy

Might be useful

Follow these links for installation and setting up Neptune on your notebook:rn- getting started with Neptune
Neptune Jupyter extension guide

The below function searches Wikipedia for a given topic and extracts information from the target page and its internal links.

import wikipediaapi  # pip install wikipedia-api
import pandas as pd
import concurrent.futures
from tqdm import tqdm

The below function lets you fetch the articles based on the topic you provide as an input to the function.

def scrape_wikipedia(name_topic, verbose=True):
   def link_to_wikipedia(link):
           page =
           if page.exists():
               return {'page': link, 'text': page.text, 'link': page.fullurl, 'categories': list(page.categories.keys())}
           return None

   api_wikipedia = wikipediaapi.Wikipedia(language='en', extract_format=wikipediaapi.ExtractFormat.WIKI)
   name_of_page =
   if not name_of_page.exists():
       print('Page {} is not present'.format(name_of_page))

   links_to_page = list(name_of_page.links.keys())
   procceed = tqdm(desc='Scraped links', unit='', total=len(links_to_page)) if verbose else None
   origin = [{'page': name_topic, 'text': name_of_page.text, 'link': name_of_page.fullurl, 'categories': list(name_of_page.categories.keys())}]

   with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
       links_future = {executor.submit(link_to_wikipedia, link): link for link in links_to_page}
       for future in concurrent.futures.as_completed(links_future):
           info = future.result()
           origin.append(info) if info else None
           procceed.update(1) if verbose else None
   procceed.close() if verbose else None

   namespaces = ('Wikipedia', 'Special', 'Talk', 'LyricWiki', 'File', 'MediaWiki',
                 'Template', 'Help', 'User', 'Category talk', 'Portal talk')
   origin = pds.DataFrame(origin)
   origin = origin[(len(origin['text']) > 20)
                     & ~(origin['page'].str.startswith(namespaces, na=True))]
   origin['categories'] = origin.categories.apply(lambda a: [b[9:] for b in a])

   origin['topic'] = name_topic
   print('Scraped pages', len(origin))

   return origin

Lets test the function on the topic “COVID-19”.

wiki_data = wiki_scrape('COVID 19')
o/p: Links Scraped: 100%|██████████| 1965/1965 [04:30<00:00,  7.25/s]ages scraped: 1749

Save the data to csv:


Import libraries:

import spacy
import pandas as pd
import requests
from spacy import displacy
# import en_core_web_sm

nlp = spacy.load('en_core_web_sm')

from spacy.tokens import Span
from spacy.matcher import Matcher

import matplotlib.pyplot as plot
from tqdm import tqdm
import networkx as ntx
import as neptune

%matplotlib inline
run = neptune.init(api_token="your API key",

Upload data to Neptune:


Download the data here. Also available on Neptune:

data = pd.read_csv('scraped_data.csv')

View data at 10th row:



The AbC-19 rapid antibody test is an immunological test for COVID-19 exposure
developed by the UK Rapid Test Consortium and manufactured by Abingdon
Health. It uses a lateral flow test to determine whether a person has IgG
antibodies to the SARS-CoV-2 virus that causes COVID-19. The test uses a single
drop of blood obtained from a finger prick and yields results in 20 minutes.

Sentence segmentation

The first step of building a knowledge graph is to split the text document or article into sentences. Then we limit our examples to simple sentences with one subject and one object.

# Lets take part of the above extracted article
docu = nlp('''The AbC-19 rapid antibody test is an immunological test for COVID-19 exposure developed by
the UK Rapid Test Consortium and manufactured by Abingdon Health. It uses a lateral flow test to determine
whether a person has IgG antibodies to the SARS-CoV-2 virus that causes COVID-19. The test uses a single
drop of blood obtained from a finger prick and yields results in 20 minutes.nnSee alsonCOVID-19 rapid
antigen test''')

for tokn in docu:
   print(tokn.text, "---", tokn.dep_)

Download the pre-trained SpaCy model as shown below:

python -m spacy download en

The SpaCy pipeline assigns word vectors, context-specific token vectors, part-of-speech tags, dependency parsing, and named entities. By extending SpaCy’s pipeline of annotations you can resolve coreferences (explained below written code).

Knowledge graphs can be automatically constructed from parts-of-speech and dependency parsing. Extraction of entity pairs from grammatical patterns is fast and scalable to large amounts of text using the NLP library SpaCy.

The following function defines entity pairs as entities/noun chunks with subject-object dependencies connected by a root verb. Other approximations can be used to produce different types of connections. This kind of connection can be referred to as subject-predicate-object triple.

The main idea is to go through a sentence and extract the subject and object, and when they’re encountered. The below function has some of the steps mentioned.

Entity extraction

You can extract a single word entity from a sentence with the help of parts-of-speech (POS) tags. The nouns and proper nouns will be the entities. 

However, when an entity spans multiple words, POS tags alone aren’t sufficient. We need to parse the dependency tree of the sentence. To build a knowledge graph, the most important things are the nodes and edges between them. 

These nodes are going to be entities that are present in the Wikipedia sentences. Edges are the relationships connecting these entities. We will extract these elements in an unsupervised manner, i.e. we’ll use the grammar of the sentences.

The idea is to go through a sentence and extract the subject and the object as and when they are reconstructed.

def extract_entities(sents):
   # chunk one
   enti_one = ""
   enti_two = ""

   dep_prev_token = "" # dependency tag of previous token in sentence

   txt_prev_token = "" # previous token in sentence

   prefix = ""
   modifier = ""

   for tokn in nlp(sents):
       # chunk two
       ## move to next token if token is punctuation

       if tokn.dep_ != "punct":
           #  check if token is compound word or not
           if tokn.dep_ == "compound":
               prefix = tokn.text
               # add the current word to it if the previous word is 'compound’
               if dep_prev_token == "compound":
                   prefix = txt_prev_token + " "+ tokn.text

           # verify if token is modifier or not
           if tokn.dep_.endswith("mod") == True:
               modifier = tokn.text
               # add it to the current word if the previous word is 'compound'
               if dep_prev_token == "compound":
                   modifier = txt_prev_token + " "+ tokn.text

           # chunk3
           if tokn.dep_.find("subj") == True:
               enti_one = modifier +" "+ prefix + " "+ tokn.text
               prefix = ""
               modifier = ""
               dep_prev_token = ""
               txt_prev_token = ""

           # chunk4
           if tokn.dep_.find("obj") == True:
               enti_two = modifier +" "+ prefix +" "+ tokn.text

           # chunk 5
           # update variable
           dep_prev_token = tokn.dep_
           txt_prev_token = tokn.text

   return [enti_one.strip(), enti_two.strip()]
extract_entities("The AbC-19 rapid antibody test is an immunological test for COVID-19 exposure developed by the UK Rapid Test")
['AbC-19 rapid antibody test', 'COVID-19 UK Rapid Test']

Now let’s use the function to extract entity pairs for 800 sentences.

pairs_of_entities = []
for i in tqdm(data['text'][:800]):

Subject object pairs from sentences:



[['where aluminium powder', 'such explosives manufacturing'],
 ['310  people', 'Cancer Research UK'],
 ['Structural External links', '2 PDBe KB'],
 ['which', '1 Medical Subject Headings'],
 ['Structural External links', '2 PDBe KB'],
 ['users', 'permanently  taste']]

Relations extraction

With entity extraction, half the job is done. To build a knowledge graph, we need to connect the nodes (entities). These edges are relations between pairs of nodes. The function below is capable of capturing such predicates from these sentences. I used spaCy’s rule-based matching. The pattern defined in the function tries to find the ROOT word or the main verb in the sentence. 

def obtain_relation(sent):

   doc = nlp(sent)

   matcher = Matcher(nlp.vocab)

   pattern = [{'DEP':'ROOT'},

   matcher.add("matching_1", None, pattern)

   matcher = matcher(doc)
   h = len(matcher) - 1

   span = doc[matcher[h][1]:matcher[h][2]]

   return (span.text

The pattern which is written above tries to find the root word in sentences. Once it is recognized then it checks if it is followed by a preposition or an agent word. If it’s a yes then it’s added to the root word.

relations = [obtain_relation(j) for j in tqdm(data['text'][:800])]

Most frequent relations extracted:


Let’s build a knowledge graph

Now we can finally create a knowledge graph from the extracted entities 

Let’s draw the network using the networkX library. We’ll create a directed multigraph network with node size in proportion to degree centrality. In other words, the relations between any connected node pair are not two-way. They’re only from one node to another.

# subject extraction
source = [j[0] for j in pairs_of_entities]

#object extraction
target = [k[1] for k in pairs_of_entities]

data_kgf = pd.DataFrame({'source':source, 'target':target, 'edge':relations})
  • We are using the networkx library to create a network from the data frame.
  • Here nodes will be represented as entities and edges represent the relationship between nodes
# Create DG from the dataframe
graph = ntx.from_pandas_edgelist(data_kgf, "source", "target",
                         edge_attr=True, create_using=ntx.MultiDiGraph())
# plotting the network
plot.figure(figsize=(14, 14))
posn = ntx.spring_layout(graph)
ntx.draw(graph, with_labels=True, node_color='green',, pos = posn)
  • From the above graph it’s unclear to get a sense of what relations are captured in the graph
  • Let’s use some relation to visualize graphs. Here I am choosing:
graph = ntx.from_pandas_edgelist(data_kgf[data_kgf['edge']=="Information from"], "source", "target",
                         edge_attr=True, create_using=ntx.MultiDiGraph())

pos = ntx.spring_layout(graph, k = 0.5) # k regulates the distance between nodes
ntx.draw(graph, with_labels=True, node_color='green', node_size=1400,, pos = posn)
  • One more graph filtered with the relation name “links” can be found here.

Logging metadata

I have logged the above networkx graph to Neptune. You can find that particular path. Log your image to a different path depending on the output obtained.


All graphs can be found here.

Coreference resolution

To obtain more refined graphs you can also use the coreference resolution.

Coreference resolution is the NLP equivalent of endophoric awareness used in information retrieval systems, conversational agents, and virtual assistants like Alexa. It’s a task of clustering mentions in text that refer to the same underlying entities.

“I”, “my”, and “she” belongs to the same cluster, and “Joe” and “he” belong to the same cluster.

Algorithms that resolve coreferences commonly look for the nearest preceding mention that’s compatible with the referring expression. Instead of using rule-based dependency parse trees, neural networks can also be trained, which take into account word embeddings and distance between mentions as features.

This significantly improves entity pair extraction by normalizing text, removing redundancies, and assigning entity pronouns.

If your use case is domain-specific, it would be worth your while to train a custom entity recognition model.

Knowledge graphs can be built automatically and explored to reveal new insights about the domain. 

Notebook uploaded to Neptune.

Notebook on GitHub.

Knowledge graphs at scale

To effectively use the entire corpus of 1749 pages for our topic, use the columns created in the wiki_scrape function to add properties to each node. Then you can track the page and category of each node. You can use multi and parallel processing to reduce execution time. 

Some of the use cases of KGs are:

Challenges ahead

Entity disambiguation and managing identity

In its simplest form, the challenge is assigning a unique normalized identity and a type to an utterance or a mention of an entity. 

Many entities extracted automatically have very similar surface forms, such as people with the same or similar names, or movies, songs, and books with the same or similar titles. Two products with similar names may refer to different listings. Without correct linking and disambiguation, entities will be incorrectly associated with the wrong facts and result in incorrect inference downstream.

Type membership and resolution

Most knowledge-graph systems today allow each entity to have multiple types, with specific types for different circumstances. Cuba can be a country or it can refer to the Cuban government. In some cases, knowledge-graph systems defer the type assignment to runtime. Each entity describes its attributes, and the application uses a specific type and collection of attributes depending on the user task.

Check also

Exploratory Data Analysis for Natural Language Processing: A Complete Guide to Python Tools

Managing changing knowledge

An effective entity-linking system needs to grow organically based on its ever-changing input data. For example, companies may merge or split, and new scientific discoveries may break a single existing entity into multiple. 

When a company acquires another company, does the acquiring company change its identity? Does identity follow the acquisition of the rights to a name? For example, in the case of KGs constructed in the healthcare industry, patient data will change over a period of time.

Knowledge extraction from multiple structured and unstructured sources

The extraction of structured knowledge (which includes entities, their types, attributes, and relationships) remains a challenge across the board. Growing graphs at scale require manual approaches and unsupervised and semi-supervised knowledge extraction from unstructured data in open domains.

Managing operations at scale

Managing scale is the underlying challenge that affects several operations related to performance and workload directly. It also manifests itself indirectly as it affects other operations, such as managing fast incremental updates to large-scale knowledge graphs.

Note: for more details on how different tech giants implement industry-scale knowledge graphs in their product and related, challenges check this article.

Natural Language Processing

Natural Language Processing (NLP) is a subfield of computer science concerned with enabling computers to process and understand human language. Technically, the main task of NLP would be to program computers for analyzing and processing huge amounts of natural language data.

Language is studied in various academic disciplines. Each discipline comes with its own set of problems and a set of solutions to address them.

Ambiguity in language

Ambiguity used in NLP can be referred to as the ability to be understood in more than one way. Natural language is ambiguous. NLP has the following ambiguities:

  • Lexical ambiguity is the ambiguity of a single word. For example, the word well can be an adverb, noun, or verb.
  • Syntactic ambiguity is the presence of 2 or more possible meanings within a single sentence or sequence of words. For example “the chicken is ready for consumption”. This sentence either means the chicken is cooked and can be eaten now, or the chicken is ready to be fed.
  • Anaphoric ambiguity is about referencing backward (or entity in another context) in a text. A phrase or word refers to something previously mentioned, but there’s more than one possibility. For example, “Margaret invited Susan for a visit, and she gave her a good meal.” (she = Margaret; her = Susan).  “Margaret invited Susan for a visit, but she told her she had to go to work” (she = Susan; her = Margaret.)
  • Pragmatic ambiguity can be defined as words that have multiple interpretations. Pragmatic ambiguity arises when the meaning of words of a sentence is not specific; it concludes with different meanings.

Text similarity metrics in NLP

Text similarity is used to determine how similar two text documents are in terms of their context or meaning. There are various similarity metrics, such as: 

  • Cosine similarity,
  • Euclidean distance,
  • Jaccard Similarity.

All these metrics have their own specification to measure the similarity between two queries.

Cosine similarity

Cosine similarity is a metric that measures text similarity between 2 documents, irrespective of their size, in NLP. A word is represented in vector form. Text documents are represented in n-dimensional vector space.

Cosine similarity measures the cosine of the angle between two n-dimensional vectors projected in multidimensional space. The cosine similarity of the two documents will range from 0 to 1. If the cosine similarity score is 1, It means 2 vectors have the same orientation. The value closer to 0 indicates that 2 documents have less similarity.

The mathematical equation of Cosine similarity between two non-zero vectors is:


The Cosine similarity is a better metric than Euclidean distance because if two text documents are far apart by Euclidean distance, there are still chances that they’re close to each other in terms of their context.

Jaccard similarity

Jaccard Similarity is also known as the Jaccard index and Intersection Over Union.

Jaccard Similarity is used to determine the similarity between two text documents, i.e. how many common words exist in all of the words.

Jaccard similarity is defined as an intersection of two documents divided by the union of two documents that refer to the number of common words over a total number of words.

The mathematical representation of the Jaccard Similarity is:

Jaccard similarity

The Jaccard similarity score is in a range of 0 to 1. If two documents are identical, Jaccard similarity is 1. The Jaccard similarity score is zero if there are no common words between the two documents.

Python Code to find Jaccard similarity

def jaccard_similarity(doc1, doc2):

 # list unique words in the document
 words_doc1 = set(doc1.lower().split())
 words_doc2 = set(doc2.lower().split())

 # find the intersection of words list of doc1 & doc2
 intersection = words_doc1.intersection(words_doc2)

 # find the union of words list of doc1 & doc2
 union = words_doc1.union(words_doc2)

 # Calculate Jaccard similarity score
 # using the length of intersection set divided by the length of union set
 return float(len(intersection)) / len(union)
docu_1 = "Work from home is the new normal in digital world"
docu_2 = "Work from home is normal"

jaccard_similarity(docu_1, docu_2)

Output: 0.5

The Jaccard similarity between doc_1 and doc_2 is 0.5

The three methods above have the same assumption: the documents (or sentences) are similar if they have common words. This idea is very straightforward. It fits some basic cases such as comparing the first 2 sentences. 

However, the scores can be relatively low by comparing the first and third sentences (for example, trying with different sentences that convey the same meaning and using the above Python function to compare similarity), even though both describe the same news.

Another limitation is that the above methods don’t handle synonyms. For example ‘buy’ and ‘purchase’ should have the same meaning (in some cases), but the above methods will treat both words differently.

So what’s the workaround? You can use word embeddings (Word2vec, GloVe, FastText).

For some of the basic concepts and use cases of NLP, I’ll be attaching some articles I’ve written on Medium, and one on Neptune’s blog for reference:


I hope you’ve learned something new here, and this article helped you understand web scraping, knowledge graphs, and a few useful NLP concepts. 

Thanks for reading, and keep on learning!