Building the South Park Knowledge Graph

November 1, 2019 - Neo4j-projects - 5 minute read

In this post I discuss the process of building a knowledge graph based on the South Park wiki. I use the position of hyperlinks on the webpage to dynamically assign node labels and relationships types, as well as properties. This type of approach should be easily applicable to other wiki’s or websites, and allows you to skip a lot of manual work defining the model.

south-park-wiki-page-layout

Example above: a cluster made of only characters with black hair. You can find a data dump of the complete graph here.

Designing a data model

The South Park Official Wiki is a great example of a wiki that can be turned into a graph. The page for Stan Marsh has over 200 links to pages on the same wiki, which can all be modeled as relationships. As simply describing these links as relationships is not that interesting, I describe a method to assign relationship types based on a link’s location on a page. As an added bonus, I extract an unique icon from every page, which could be used for cool graph visualizations.

Even though I’m aiming for dynamically generated node labels and relationship types, I made a quick sketch of the data model to get an idea of what is happening: south-park-wiki-page-layout

There are some great online tools for drawing graph data models. Check out arrows, a web app for simple and quick graph modeling.

Extracting relationships

Browsing through the wiki, I found a lot of details on the personal relationships between the characters. On the character’s page, relationships to other characters look like this:

south-park-wiki-page-layout

This always has the same format:

  • A header with the type of relationship. Here: Close friends.
  • Character icons with hyperlinks to other wiki pages.

I dynamically generate relationships based on this convention. I parse the section header, put it after HAS_ and convert the last word to its singular form:

convert_to_rel_syntax('Close Friends') = 'HAS_CLOSE_FRIEND'
convert_to_rel_syntax('Family') = 'HAS_FAMILY'
convert_to_rel_syntax('Residence') = 'HAS_RESIDENCE'

Then, I use the URL of the webpage as a node ID, leaving out the prefix https://wiki.southpark.cc.com/wiki/. Then, the parsed relationships to be put into Neo4j look like this:

Stan_Marsh,HAS_CLOSE_FRIEND,Eric_Cartman
Stan_Marsh,HAS_FAMILY,Randy_Marsh
Stan_Marsh,HAS_RESIDENCE,Marsh_Residence

Obviously this simple method doesn’t produce good names from all section headers, but for most of the relationships it does the trick.

Adding dynamic node labels

To make our graph more interesting and easier to use, we assign some labels to the nodes. By looking at incoming relationships into a node, we can do this easily. For example, given a relationship:

Scott_Malkinson,HAS_CLASSMATE,Stan_Marsh

Then, node Stan_Marsh will get additional label Classmate. This will make querying for nodes a lot easier in the future.

Adding node properties

From the page body, I extract five basic properties: id, name, wiki_url, image_url, group_name. Next, you’ll find that the wiki page for every character has an ‘info box’ with some character properties:

south-park-wiki-page-layout

Parsing these provide good properties for a node. In some cases, these might even be better modeled as relationships:

Stan_Marsh,HAS_FATHER,Randy_Marsh

For now, we take age, gender, full_name, hair_color from the info box as properties, and model the rest as relationships.

Importing into Neo4j

Importing data into Neo4j is easy. I opted for the data importer command as opposed to using LOAD CSV, but both should work fine for small datasets. Given that the nodes and relationships are in the right format, all it takes is a single command:

./bin/neo4j-admin import --nodes nodes.csv --relationships edges.csv

After import, the graph has the following properties:

4654 nodes 
6960 relationships 
23106 properties
270 unique relationship types

Results

To check if everything works, I took a look at the Marsh family tree:

MATCH (n)-[e:HAS_FATHER|:HAS_MOTHER|:HAS_SISTER|:HAS_BROTHER*1..3]->(m)
WHERE n.name = "Stan Marsh"
RETURN *

south-park-wiki-page-layout

It works! There’s still some cleaning to do though: you’ll notice that Shelly has two HAS_BROTHER relationships going to Stan, an artifact originating from scraping both the info-box and the wiki sections.

Let’s see the four main boys and the relationships to their fellow 4th graders:

MATCH (n:TheFourBoys)-[e]->(m)-[:HAS_GRADE]->(grade{name:"4th Grade"})
RETURN n, e, m

south-park-wiki-page-layout

You’ll quickly see this gets a lots busier, with almost all relationships shared amongst the four boys. If you look closely, you’ll already find small inconsistencies in the wiki: Kenny considers Stan his best friend, but this is not mutual (poor Kenny).

To make more sense of the graph, we’re going to need some better visualization tools and do some better clustering of the data. More about this in a future post.

What’s next?

There’s a lot of roads to go down from here:

  • First of all, there’s a lot to improve in the dynamic naming of node labels and relationship types. Better synonyms, filtering rare relationships types and grouping relationship types together can make the graph a lot cleaner.
  • Next, there are many pages on the wiki not yet embedded in the graph. There’s a ton of locations with properties, as well as other interesting pages to extract.
  • Third, we need some better visualization tools. Since we have icons for all of the characters, it would be cool if we can visualize the entire graph with character icons.

Download and Source

The script I used for scraping can be found on Github. The resulting CSVs are also in the repo. As the layout of the wiki is subject to change, using the script might require some tweaking.

Updated: