In this post I discuss the process of building a knowledge graph based on the South Park wiki. I use the position of hyperlinks on the webpage to dynamically assign node labels and relationships types, as well as properties. This type of approach should be easily applicable to other wiki’s or websites, and allows you to skip a lot of manual work defining the model.
Example above: a cluster made of only characters with black hair. You can find a data dump of the complete graph here.
Designing a data model
The South Park Official Wiki is a great example of a wiki that can be turned into a graph. The page for Stan Marsh has over 200 links to pages on the same wiki, which can all be modeled as relationships. As simply describing these links as relationships is not that interesting, I describe a method to assign relationship types based on a link’s location on a page. As an added bonus, I extract an unique icon from every page, which could be used for cool graph visualizations.
There are some great online tools for drawing graph data models. Check out arrows, a web app for simple and quick graph modeling.
Browsing through the wiki, I found a lot of details on the personal relationships between the characters. On the character’s page, relationships to other characters look like this:
This always has the same format:
- A header with the type of relationship. Here: Close friends.
- Character icons with hyperlinks to other wiki pages.
I dynamically generate relationships based on this convention. I parse the section header, put it after
HAS_ and convert the last word to its singular form:
Then, I use the URL of the webpage as a node ID, leaving out the prefix
https://wiki.southpark.cc.com/wiki/. Then, the parsed relationships to be put into Neo4j look like this:
Obviously this simple method doesn’t produce good names from all section headers, but for most of the relationships it does the trick.
Adding dynamic node labels
To make our graph more interesting and easier to use, we assign some labels to the nodes. By looking at incoming relationships into a node, we can do this easily. For example, given a relationship:
Stan_Marsh will get additional label
Classmate. This will make querying for nodes a lot easier in the future.
Adding node properties
From the page body, I extract five basic properties:
id, name, wiki_url, image_url, group_name. Next, you’ll find that the wiki page for every character has an ‘info box’ with some character properties:
Parsing these provide good properties for a node. In some cases, these might even be better modeled as relationships:
For now, we take
age, gender, full_name, hair_color from the info box as properties, and model the rest as relationships.
Importing into Neo4j
Importing data into Neo4j is easy. I opted for the data importer command as opposed to using
LOAD CSV, but both should work fine for small datasets.
Given that the nodes and relationships are in the right format, all it takes is a single command:
After import, the graph has the following properties:
To check if everything works, I took a look at the Marsh family tree:
It works! There’s still some cleaning to do though: you’ll notice that Shelly has two
HAS_BROTHER relationships going to Stan, an artifact originating from scraping both the info-box and the wiki sections.
Let’s see the four main boys and the relationships to their fellow 4th graders:
You’ll quickly see this gets a lots busier, with almost all relationships shared amongst the four boys. If you look closely, you’ll already find small inconsistencies in the wiki: Kenny considers Stan his best friend, but this is not mutual (poor Kenny).
To make more sense of the graph, we’re going to need some better visualization tools and do some better clustering of the data. More about this in a future post.
There’s a lot of roads to go down from here:
- First of all, there’s a lot to improve in the dynamic naming of node labels and relationship types. Better synonyms, filtering rare relationships types and grouping relationship types together can make the graph a lot cleaner.
- Next, there are many pages on the wiki not yet embedded in the graph. There’s a ton of locations with properties, as well as other interesting pages to extract.
- Third, we need some better visualization tools. Since we have icons for all of the characters, it would be cool if we can visualize the entire graph with character icons.
Download and Source
The script I used for scraping can be found on Github. The resulting CSVs are also in the repo. As the layout of the wiki is subject to change, using the script might require some tweaking.