You need to get some information online. For example, a few paragraphs about Usain Bolt. You can copy and paste the information from Wikipedia, it won’t be much work.
But what if you needed to get information about all competitions that Usain Bolt had taken part in and all related stats about him and his competitors? And then what if you wanted to do that for all sports, not just running? Additionally, what if you wanted to infer some relationship among this plethora of unstructured data?
Machine learning engineers often need to build complex datasets like the example above to train their models. Web scraping is a very useful method to collect the necessary data, but it comes with some challenges.
In this article, I’m going to explain how to build knowledge graphs by scraping publicly available data.
What is web scraping?
Web scraping (or web harvesting) is data scraping used for data extraction. The term typically refers to collecting data with a bot or web crawler. It’s a form of copying in which specific data is gathered and copied from the web, typically into a local database or spreadsheet for later use or analysis.
You can do web scraping with online services, APIs, or you can write your own code that will do it.
There are two key elements to web scraping:
- Crawler: The crawler is an algorithm that browses the web to search for particular data by exploring links across the internet.
- Scraper: The scraper extracts data from websites. The design of scrapers can vary a lot. It depends on the complexity and scope of the project. Ultimately it has to quickly and accurately extract the data.
A good example of a ready-made library is the Wikipedia scraper library. It does a lot of the heavy lifting for you. You provide URLs with the required data, it loads all the HTML from those sites. The scraper takes the data you need from this HTML code and outputs the data in your chosen format. This can be an excel spreadsheet or CSV, or a format like JSON.
The amount of content available on the web is incredible already, and it’s expanding at an increasingly fast rate. Billions of websites are linked with the World Wide Web, and search engines can go through those links and serve useful information with great precision and speed. This is in part thanks to knowledge graphs.
Different organizations have different knowledge graphs. For example, the Google Knowledge Graph is a knowledge base used by Google and its services to enhance search engine results with information gathered from a variety of sources. Similar techniques are used in Facebook or Amazon products for a better user experience and to store and retrieve useful information.
There’s no formal definition of a knowledge graph (KG). Broadly speaking, a KG is a kind of semantic network with added constraints. Its scope, structure and characteristics, and even its uses aren’t fully realized in the process of development.
Bringing knowledge graphs and machine learning (ML) together can systematically improve the accuracy of systems and extend the range of machine learning capabilities. Thanks to knowledge graphs, results inferred from machine learning models will have better explainability and trustworthiness.
Bringing knowledge graphs and ML together creates some interesting opportunities. In cases where we might have insufficient data, KGs can be used to augment training data. One of the major challenges in ML models is explaining predictions made by ML systems. Knowledge graphs can help overcome this issue by mapping explanations to proper nodes in the graph and summarizing the decision-making process.
Another way to look at it is that a knowledge graph stores data that resulted from an information extraction task. Many implementations of KG make use of a concept called triplet — a set of three items (a subject, a predicate, and an object) that we can use to store information about something.
Node A and Node B are 2 different entities. These nodes are connected by an edge that represents their relationship. This is the smallest KG we can build – also known as a triple. Knowledge graphs come in a variety of shapes and sizes.
Web scraping, computational linguistics, NLP algorithms, and graph theory (with Python code)
Phew, that’s a wordy heading. Anyway, to build knowledge graphs from text, it’s important to help our machine understand natural language. We do this with NLP techniques such as sentence segmentation, dependency parsing, parts-of-speech (POS) tagging, and entity recognition.
The first step to build a KG is to collect your sources — let’s crawl the web for some information. Wikipedia will be our source (always check the sources of data, a lot of information online is false).
For this blog, we’ll use the Wikipedia API, a direct Python wrapper, and neptune.ai to manage, log, store, display, and organize the metadata. If you expect your project to produce a number of artifacts (like this one), it is really helpful to use a tool for tracking and versioning them.
Installation and setup
Install dependencies and scrape data:
Check these resources for installation and
The below function searches Wikipedia for a given topic and extracts information from the target page and its internal links.
The below function lets you fetch the articles based on the topic you provide as an input to the function.
Let’s test the function on the topic “COVID-19”.
Save the data to csv
Download spacy package
Upload data to Neptune.
The first step of building a knowledge graph is to split the text document or article into sentences. Then we limit our examples to simple sentences with one subject and one object.
Download the pre-trained SpaCy model as shown below:
The SpaCy pipeline assigns word vectors, context-specific token vectors, part-of-speech tags, dependency parsing, and named entities. By extending SpaCy’s pipeline of annotations you can resolve coreferences (explained below).
Knowledge graphs can be automatically constructed from parts-of-speech and dependency parsing. Extraction of entity pairs from grammatical patterns is fast and scalable to large amounts of text using the NLP library SpaCy.
The following function defines entity pairs as entities/noun chunks with subject-object dependencies connected by a root verb. Other approximations can be used to produce different types of connections. This kind of connection can be referred to as subject-predicate-object triple.
The main idea is to go through a sentence and extract the subject and object, and when they’re encountered. The below function has some of the steps mentioned.
You can extract a single-word entity from a sentence with the help of parts-of-speech (POS) tags. The nouns and proper nouns will be the entities.
However, when an entity spans multiple words, POS tags alone aren’t sufficient. We need to parse the dependency tree of the sentence. To build a knowledge graph, the most important things are the nodes and edges between them.
These nodes are going to be entities that are present in the Wikipedia sentences. Edges are the relationships connecting these entities. We will extract these elements in an unsupervised manner, i.e. we’ll use the grammar of the sentences.
The idea is to go through a sentence and extract the subject and the object as and when they are reconstructed.
Now let’s use the function to extract entity pairs for 800 sentences.
Subject object pairs from sentences
With entity extraction, half the job is done. To build a knowledge graph, we need to connect the nodes (entities). These edges are relations between pairs of nodes. The function below is capable of capturing such predicates from these sentences. I used spaCy’s rule-based matching. The pattern defined in the function tries to find the ROOT word or the main verb in the sentence.
The pattern which is written above tries to find the root word in sentences. Once it is recognized then it checks if a preposition or an agent word follows it. If it’s a yes then it’s added to the root word.
Building a knowledge graph
Now we can finally create a knowledge graph from the extracted entities
Let’s draw the network using the networkX library. We’ll create a directed multigraph network with node size in proportion to the degree of centrality. In other words, the relations between any connected node pair are not two-way. They’re only from one node to another.
- We are using the networkx library to create a network from the data frame.
- Here nodes will be represented as entities and edges represent the relationship between nodes
- From the above graph it’s unclear to get a sense of what relations are captured in the graph
- Let’s use some relation to visualize graphs. Here I am choosing
One more graph filtered with the relation name “links” can be found here.
All graphs can be found here.
To obtain more refined graphs you can also use the coreference resolution.
Coreference resolution is the NLP equivalent of endophoric awareness used in information retrieval systems, conversational agents, and virtual assistants like Alexa. It’s a task of clustering mentions in text that refer to the same underlying entities.
“I”, “my”, and “she” belongs to the same cluster, and “Joe” and “he” belong to the same cluster.
Algorithms that resolve coreferences commonly look for the nearest preceding mention that’s compatible with the referring expression. Instead of using rule-based dependency parse trees, neural networks can also be trained, which take into account word embeddings and distance between mentions as features.
This significantly improves entity pair extraction by normalizing text, removing redundancies, and assigning entity pronouns.
If your use case is domain-specific, it would be worth your while to train a custom entity recognition model.
Knowledge graphs can be built automatically and explored to reveal new insights about the domain.
Notebook uploaded to neptune.ai.
Notebook on GitHub.
Knowledge graphs at scale
To effectively use the entire corpus of 1749 pages for our topic, use the columns created in the scrape_wikipedia function to add properties to each node. Then you can track the page and category of each node. You can use multi and parallel processing to reduce execution time.
Some of the use cases of KGs are:
- 1 Question answering,
- 2 Storing information,
- 3 Recommendation systems,
- 4 Supply chain management.
Entity disambiguation and managing identity
In its simplest form, the challenge is assigning a unique normalized identity and a type to an utterance or a mention of an entity.
Many entities extracted automatically have very similar surface forms, such as people with the same or similar names, or movies, songs, and books with the same or similar titles. Two products with similar names may refer to different listings. Without correct linking and disambiguation, entities will be incorrectly associated with the wrong facts and result in incorrect inference downstream.
Type membership and resolution
Most knowledge-graph systems today allow each entity to have multiple types, with specific types for different circumstances. Cuba can be a country or it can refer to the Cuban government. In some cases, knowledge-graph systems defer the type assignment to runtime. Each entity describes its attributes, and the application uses a specific type and collection of attributes depending on the user task.
Managing changing knowledge
An effective entity-linking system needs to grow organically based on its ever-changing input data. For example, companies may merge or split, and new scientific discoveries may break a single existing entity into multiple.
When a company acquires another company, does the acquiring company change its identity? Does identity follow the acquisition of the rights to a name? For example, in the case of KGs constructed in the healthcare industry, patient data will change over a period of time.
Knowledge extraction from multiple structured and unstructured sources
The extraction of structured knowledge (which includes entities, their types, attributes, and relationships) remains a challenge across the board. Growing graphs at scale require manual approaches and unsupervised and semi-supervised knowledge extraction from unstructured data in open domains.
Managing operations at scale
Managing scale is the underlying challenge that affects several operations related to performance and workload directly. It also manifests itself indirectly as it affects other operations, such as managing fast incremental updates to large-scale knowledge graphs.
Note: for more details on how different tech giants implement industry-scale knowledge graphs in their product and related, challenges check this article.
I hope you’ve learned something new here, and this article helped you understand web scraping, knowledge graphs, and a few useful NLP concepts.
Thanks for reading, and keep on learning!