MLOps Blog

Knowledge Graphs With Machine Learning [Guide]

6 min
27th July, 2023

You need to get some information online. For example, a few paragraphs about Usain Bolt. You can copy and paste the information from Wikipedia, it won’t be much work. 

But what if you needed to get information about all competitions that Usain Bolt had taken part in and all related stats about him and his competitors? And then what if you wanted to do that for all sports, not just running? Additionally, what if you wanted to infer some relationship among this plethora of unstructured data?

Machine learning engineers often need to build complex datasets like the example above to train their models. Web scraping is a very useful method to collect the necessary data, but it comes with some challenges.

In this article, I’m going to explain how to build knowledge graphs by scraping publicly available data.

What is web scraping?

Web scraping (or web harvesting) is data scraping used for data extraction. The term typically refers to collecting data with a bot or web crawler. It’s a form of copying in which specific data is gathered and copied from the web, typically into a local database or spreadsheet for later use or analysis.

Web scrapping
Diagram with a web scraping | Source

You can do web scraping with online services, APIs, or you can write your own code that will do it. 

There are two key elements to web scraping:

  • Crawler: The crawler is an algorithm that browses the web to search for particular data by exploring links across the internet.
  • Scraper: The scraper extracts data from websites. The design of scrapers can vary a lot. It depends on the complexity and scope of the project. Ultimately it has to quickly and accurately extract the data.

A good example of a ready-made library is the Wikipedia scraper library. It does a lot of the heavy lifting for you. You provide URLs with the required data, it loads all the HTML from those sites. The scraper takes the data you need from this HTML code and outputs the data in your chosen format. This can be an excel spreadsheet or CSV, or a format like JSON.

Knowledge graph

The amount of content available on the web is incredible already, and it’s expanding at an increasingly fast rate. Billions of websites are linked with the World Wide Web, and search engines can go through those links and serve useful information with great precision and speed. This is in part thanks to knowledge graphs.

Different organizations have different knowledge graphs. For example, the Google Knowledge Graph is a knowledge base used by Google and its services to enhance search engine results with information gathered from a variety of sources. Similar techniques are used in Facebook or Amazon products for a better user experience and to store and retrieve useful information. 

There’s no formal definition of a knowledge graph (KG). Broadly speaking, a KG is a  kind of semantic network with added constraints. Its scope, structure and characteristics, and even its uses aren’t fully realized in the process of development.

Bringing knowledge graphs and machine learning (ML) together can systematically improve the accuracy of systems and extend the range of machine learning capabilities. Thanks to knowledge graphs, results inferred from machine learning models will have better explainability and trustworthiness

Bringing knowledge graphs and ML together creates some interesting opportunities. In cases where we might have insufficient data, KGs can be used to augment training data. One of the major challenges in ML models is explaining predictions made by ML systems. Knowledge graphs can help overcome this issue by mapping explanations to proper nodes in the graph and summarizing the decision-making process.

Learn more

Data Augmentation in Python: Everything You Need to Know

Data Augmentation in NLP: Best Practices From a Kaggle Master

Another way to look at it is that a knowledge graph stores data that resulted from an information extraction task. Many implementations of KG make use of a concept called triplet — a set of three items (a subject, a predicate, and an object) that we can use to store information about something. 

 A knowledge graph
 A knowledge graph

Node A and Node B are 2 different entities. These nodes are connected by an edge that represents their relationship. This is the smallest KG we can build – also known as a triple. Knowledge graphs come in a variety of shapes and sizes.

Web scraping, computational linguistics, NLP algorithms, and graph theory (with Python code)

Phew, that’s a wordy heading. Anyway, to build knowledge graphs from text, it’s important to help our machine understand natural language. We do this with NLP techniques such as sentence segmentation, dependency parsing, parts-of-speech (POS) tagging, and entity recognition.

The first step to build a KG is to collect your sources — let’s crawl the web for some information. Wikipedia will be our source (always check the sources of data, a lot of information online is false).

For this blog, we’ll use the Wikipedia API, a direct Python wrapper, and neptune.ai to manage, log, store, display, and organize the metadata. If you expect your project to produce a number of artifacts (like this one), it is really helpful to use a tool for tracking and versioning them.

Installation and setup

Install dependencies and scrape data:

!pip install wikipedia-api neptune neptune-notebooks pandas spacy networkx scipy

Check these resources for installation and

The below function searches Wikipedia for a given topic and extracts information from the target page and its internal links.

import wikipediaapi  # pip install wikipedia-api
import pandas as pd
import concurrent.futures
from tqdm import tqdm

The below function lets you fetch the articles based on the topic you provide as an input to the function.

def scrape_wikipedia(name_topic, verbose=True):
   def link_to_wikipedia(link):
       try:
           page = api_wikipedia.page(link)
           if page.exists():
               return {'page': link, 'text': page.text, 'link': page.fullurl, 'categories': list(page.categories.keys())}
       except:
           return None
      
   api_wikipedia = wikipediaapi.Wikipedia(language='en', extract_format=wikipediaapi.ExtractFormat.WIKI)
   name_of_page = api_wikipedia.page(name_topic)
   if not name_of_page.exists():
       print('Page {} is not present'.format(name_of_page))
       return
  
   links_to_page = list(name_of_page.links.keys())
   procceed = tqdm(desc='Scraped links', unit='', total=len(links_to_page)) if verbose else None
   origin = [{'page': name_topic, 'text': name_of_page.text, 'link': name_of_page.fullurl, 'categories': list(name_of_page.categories.keys())}]
  
   with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
       links_future = {executor.submit(link_to_wikipedia, link): link for link in links_to_page}
       for future in concurrent.futures.as_completed(links_future):
           info = future.result()
           origin.append(info) if info else None
           procceed.update(1) if verbose else None
   procceed.close() if verbose else None
  
   namespaces = ('Wikipedia', 'Special', 'Talk', 'LyricWiki', 'File', 'MediaWiki',
                 'Template', 'Help', 'User', 'Category talk', 'Portal talk')
   origin = pds.DataFrame(origin)
   origin = origin[(len(origin['text']) > 20)
                     & ~(origin['page'].str.startswith(namespaces, na=True))]
   origin['categories'] = origin.categories.apply(lambda a: [b[9:] for b in a])

   origin['topic'] = name_topic
   print('Scraped pages', len(origin))
  
   return origin

Let’s test the function on the topic “COVID-19”.

data_wikipedia = scrape_wikipedia('COVID 19')
o/p: Links Scraped: 100%|██████████| 1965/1965 [04:30<00:007.25/s]pages scrapped: 1749

Save the data to csv

data_wikipedia.to_csv('scraped_data.csv')

Download spacy package

python -m spacy download en_core_web_sm

Import libraries 

import spacy
import pandas as pd
import requests
from spacy import displacy
# import en_core_web_sm
 
nlp = spacy.load('en_core_web_sm')
 
from spacy.tokens import Span
from spacy.matcher import Matcher
 
import matplotlib.pyplot as plot
from tqdm import tqdm
import networkx as ntx
import neptune
 
%matplotlib inline
run = neptune.init_run(api_token="your API key",
                   project="aravindcr/KnowledgeGraphs")

Upload data to Neptune.

run["data"].upload("scraped_data.csv")

Download the data here. Also available in Neptune.

data = pd.read_csv('scraped_data.csv')

Output:

The AbC-19 rapid antibody test is an immunological test for COVID-19 exposure developed by the UK Rapid Test Consortium and manufactured by Abingdon Health. It uses a lateral flow test to determine whether a person has IgG antibodies to the SARS-CoV-2 virus that causes COVID-19. The test uses a single drop of blood obtained from a finger prick and yields results in 20 minutes.

Sentence segmentation

The first step of building a knowledge graph is to split the text document or article into sentences. Then we limit our examples to simple sentences with one subject and one object.

# Lets take part of the above extracted article
docu = nlp('''The AbC-19 rapid antibody test is an immunological test for COVID-19 exposure developed by
the UK Rapid Test Consortium and manufactured by Abingdon Health. It uses a lateral flow test to determine
whether a person has IgG antibodies to the SARS-CoV-2 virus that causes COVID-19. The test uses a single
drop of blood obtained from a finger prick and yields results in 20 minutes.\n\nSee also\nCOVID-19 rapid
antigen test''')
 
for tokn in docu:
   print(tokn.text, "---", tokn.dep_)

Download the pre-trained SpaCy model as shown below:

python -m spacy download en

The SpaCy pipeline assigns word vectors, context-specific token vectors, part-of-speech tags, dependency parsing, and named entities. By extending SpaCy’s pipeline of annotations you can resolve coreferences (explained below).

Knowledge graphs can be automatically constructed from parts-of-speech and dependency parsing. Extraction of entity pairs from grammatical patterns is fast and scalable to large amounts of text using the NLP library SpaCy.

The following function defines entity pairs as entities/noun chunks with subject-object dependencies connected by a root verb. Other approximations can be used to produce different types of connections. This kind of connection can be referred to as subject-predicate-object triple.

The main idea is to go through a sentence and extract the subject and object, and when they’re encountered. The below function has some of the steps mentioned.

Entity extraction

You can extract a single-word entity from a sentence with the help of parts-of-speech (POS) tags. The nouns and proper nouns will be the entities. 

However, when an entity spans multiple words, POS tags alone aren’t sufficient. We need to parse the dependency tree of the sentence. To build a knowledge graph, the most important things are the nodes and edges between them. 

These nodes are going to be entities that are present in the Wikipedia sentences. Edges are the relationships connecting these entities. We will extract these elements in an unsupervised manner, i.e. we’ll use the grammar of the sentences.

The idea is to go through a sentence and extract the subject and the object as and when they are reconstructed.

def extract_entities(sents):
   # chunk one
   enti_one = ""
   enti_two = ""
  
   dep_prev_token = "" # dependency tag of previous token in sentence
  
   txt_prev_token = "" # previous token in sentence
  
   prefix = ""
   modifier = ""
  
  
  
   for tokn in nlp(sents):
       # chunk two
       ## move to next token if token is punctuation
      
       if tokn.dep_ != "punct":
           #  check if token is compound word or not
           if tokn.dep_ == "compound":
               prefix = tokn.text
               # add the current word to it if the previous word is 'compound’
               if dep_prev_token == "compound":
                   prefix = txt_prev_token + " "+ tokn.text
                  
           # verify if token is modifier or not
           if tokn.dep_.endswith("mod") == True:
               modifier = tokn.text
               # add it to the current word if the previous word is 'compound'
               if dep_prev_token == "compound":
                   modifier = txt_prev_token + " "+ tokn.text
                  
           # chunk3
           if tokn.dep_.find("subj") == True:
               enti_one = modifier +" "+ prefix + " "+ tokn.text
               prefix = ""
               modifier = ""
               dep_prev_token = ""
               txt_prev_token = ""
              
           # chunk4
           if tokn.dep_.find("obj") == True:
               enti_two = modifier +" "+ prefix +" "+ tokn.text
              
           # chunk 5
           # update variable
           dep_prev_token = tokn.dep_
           txt_prev_token = tokn.text
          
   return [enti_one.strip(), enti_two.strip()]
extract_entities("The AbC-19 rapid antibody test is an immunological test for COVID-19 exposure developed by the UK Rapid Test")
['AbC-19 rapid antibody test', 'COVID-19 UK Rapid Test']

Now let’s use the function to extract entity pairs for 800 sentences.

pairs_of_entities = []
for i in tqdm(data['text'][:800]):
   pairs_of_entities.append(extract_entities(i))

Subject object pairs from sentences

pairs_of_entities[36:42]

Output:

[['where aluminium powder', 'such explosives manufacturing'],

 ['310  people', 'Cancer Research UK'],

 ['Structural External links', '2 PDBe KB'],

 ['which', '1 Medical Subject Headings'],

 ['Structural External links', '2 PDBe KB'],

 ['users', 'permanently  taste']]

Relations extraction

With entity extraction, half the job is done. To build a knowledge graph, we need to connect the nodes (entities). These edges are relations between pairs of nodes. The function below is capable of capturing such predicates from these sentences. I used spaCy’s rule-based matching. The pattern defined in the function tries to find the ROOT word or the main verb in the sentence. 

def obtain_relation(sent):
  
   doc = nlp(sent)
  
   matcher = Matcher(nlp.vocab)
  
   pattern = [{'DEP':'ROOT'},
           {'DEP':'prep','OP':"?"},
           {'DEP':'agent','OP':"?"}, 
           {'POS':'ADJ','OP':"?"}]
  
   matcher.add("matching_1", None, pattern)
  
   matcher = matcher(doc)
   h = len(matcher) - 1
  
   span = doc[matcher[h][1]:matcher[h][2]]
  
   return (span.text

The pattern which is written above tries to find the root word in sentences. Once it is recognized then it checks if a preposition or an agent word follows it. If it’s a yes then it’s added to the root word.

relations = [obtain_relation(j) for j in tqdm(data['text'][:800])]

Building a knowledge graph

Now we can finally create a knowledge graph from the extracted entities 

Let’s draw the network using the networkX library. We’ll create a directed multigraph network with node size in proportion to the degree of centrality. In other words, the relations between any connected node pair are not two-way. They’re only from one node to another.

# subject extraction
source = [j[0] for j in pairs_of_entities]

#object extraction
target = [k[1] for k in pairs_of_entities]

data_kgf = pd.DataFrame({'source':source, 'target':target, 'edge':relations})
  • We are using the networkx library to create a network from the data frame.
  • Here nodes will be represented as entities and edges represent the relationship between nodes
# Create DG from the dataframe
graph = ntx.from_pandas_edgelist(data_kgf, "source", "target",
                         edge_attr=True, create_using=ntx.MultiDiGraph())
# plotting the network
plot.figure(figsize=(14, 14))
posn = ntx.spring_layout(graph)
ntx.draw(graph, with_labels=True, node_color='green', edge_cmap=plot.cm.Blues, pos = posn)
plot.show()
Graph from neptune.ai
Graph logged to neptune.ai
  • From the above graph it’s unclear to get a sense of what relations are captured in the graph
  • Let’s use some relation to visualize graphs. Here I am choosing
graph = ntx.from_pandas_edgelist(data_kgf[data_kgf['edge']=="Information from"], "source", "target",
                         edge_attr=True, create_using=ntx.MultiDiGraph())
 
plot.figure(figsize=(14,14))
pos = ntx.spring_layout(graph, k = 0.5) # k regulates the distance between nodes
ntx.draw(graph, with_labels=True, node_color='green', node_size=1400, edge_cmap=plot.cm.Blues, pos = posn)
plot.show()
Knowledge graph
Graph logged to neptune.ai

One more graph filtered with the relation name “links” can be found here

Logging metadata

I have logged the above networkx graph to Neptune. You can find that particular path. Log your image to a different path depending on the output obtained.

run['graphs/all_in_graph'].upload('graph.png')
run['graphs/filtered_relations'].upload('info.png')
run['graphs/filtered_relations2'].upload('links.png')

All graphs can be found here.

Coreference resolution

To obtain more refined graphs you can also use the coreference resolution.

Coreference resolution is the NLP equivalent of endophoric awareness used in information retrieval systems, conversational agents, and virtual assistants like Alexa. It’s a task of clustering mentions in text that refer to the same underlying entities.

“I”, “my”, and “she” belongs to the same cluster, and “Joe” and “he” belong to the same cluster.

Algorithms that resolve coreferences commonly look for the nearest preceding mention that’s compatible with the referring expression. Instead of using rule-based dependency parse trees, neural networks can also be trained, which take into account word embeddings and distance between mentions as features.

This significantly improves entity pair extraction by normalizing text, removing redundancies, and assigning entity pronouns.

If your use case is domain-specific, it would be worth your while to train a custom entity recognition model.

Knowledge graphs can be built automatically and explored to reveal new insights about the domain. 

Notebook uploaded to neptune.ai.

Notebook on GitHub.

Knowledge graphs at scale

To effectively use the entire corpus of 1749 pages for our topic, use the columns created in the scrape_wikipedia function to add properties to each node. Then you can track the page and category of each node. You can use multi and parallel processing to reduce execution time. 

Some of the use cases of KGs are:

  • 1 Question answering,
  • 2 Storing information,
  • 3 Recommendation systems,
  • 4 Supply chain management.

Challenges ahead

Entity disambiguation and managing identity

In its simplest form, the challenge is assigning a unique normalized identity and a type to an utterance or a mention of an entity. 

Many entities extracted automatically have very similar surface forms, such as people with the same or similar names, or movies, songs, and books with the same or similar titles. Two products with similar names may refer to different listings. Without correct linking and disambiguation, entities will be incorrectly associated with the wrong facts and result in incorrect inference downstream.

Type membership and resolution

Most knowledge-graph systems today allow each entity to have multiple types, with specific types for different circumstances. Cuba can be a country or it can refer to the Cuban government. In some cases, knowledge-graph systems defer the type assignment to runtime. Each entity describes its attributes, and the application uses a specific type and collection of attributes depending on the user task.

Check also

Exploratory Data Analysis for Natural Language Processing: A Complete Guide to Python Tools

Managing changing knowledge

An effective entity-linking system needs to grow organically based on its ever-changing input data. For example, companies may merge or split, and new scientific discoveries may break a single existing entity into multiple. 

When a company acquires another company, does the acquiring company change its identity? Does identity follow the acquisition of the rights to a name? For example, in the case of KGs constructed in the healthcare industry, patient data will change over a period of time.

Knowledge extraction from multiple structured and unstructured sources

The extraction of structured knowledge (which includes entities, their types, attributes, and relationships) remains a challenge across the board. Growing graphs at scale require manual approaches and unsupervised and semi-supervised knowledge extraction from unstructured data in open domains.

Managing operations at scale

Managing scale is the underlying challenge that affects several operations related to performance and workload directly. It also manifests itself indirectly as it affects other operations, such as managing fast incremental updates to large-scale knowledge graphs.

Note: for more details on how different tech giants implement industry-scale knowledge graphs in their product and related, challenges check this article.

Conclusion 

I hope you’ve learned something new here, and this article helped you understand web scraping, knowledge graphs, and a few useful NLP concepts. 

Thanks for reading, and keep on learning!

Was the article useful?

Thank you for your feedback!