Wiki Game Web Crawler

David Miller
5 min readMay 2, 2021

--

Wikipedia hosts a plethora of information on pretty much everything from science to current global trends. So much information that it will almost have everything that you may want to look up. There is a game that asks a deceptively simple question. If I start on one Wikipedia page how can I get to another?

For example how many clicks will it take to get from Alan Turing to Computers? The answer is one since there is a direct link on his Wikipedia page to computers. If we do a harder example such as Alan Turing to the Mona Lisa, then the path is much harder to figure out.

I created a Web Crawler that solves the wiki game. For my final project and have my (messy) code available on GitHub here — https://github.com/Kanghuaroo/WikiGame

Here is a quick rundown of the problems I had to solve.

  • The first problem was to get all of the links out of a Wikipedia page
  • Then we had to remove duplicate links as well as helper/meta links, like links to the homepage for example
  • After that is the big boy search algorithm
Dijkstra’s Algorithm aka Breadth First Search

And that’s all the Python code you need to solve the Wiki game. The only problem is that this solution is slow and dumb.

The first test show below went fine as the link was directly on the page we were searching for, being able to find the answer in about .5 secs

Alan Turing -> Computer Scientist

And it was even fine for when there were two clicks needed to get to it’s destination

Alan Turing -> Data

But links with a depth of three took wayyyyy too long. I needed to optimize the algorithm. Without getting too in depth with Computer Science terms I needed a way to see if we had already been to a link or not so that we do not search the same path of links twice. I was keeping everything in a list that we could traverse from start to end checking everything in it to make sure that we never saw our current link. But this was slow due to there being an immense number of links as shown below.

Number of links after scraping 8 links

After searching 8 links we have 1597 links we have seen. This is our of 408 links on the Alan Turing Wikipedia page. We haven’t even gotten past the first page!

The fix I had was using these thing called hashes. Long story short I use a lot of fancy math under the hood to get a unique id for each link I pass it. This way I can directly go to that link’s spot in line instead of going through the entire list of links we have seen so far.

Now it works for a depth of 3 links between the start and end.

Alan Turing -> Algebraic Number

…But it still takes a long time. 11 minutes and 11 seconds of computation to be exact. Unfortunately this may be the fastest time available to me without rewriting the search algorithm.

Remember how I said that I was using a search algorithm that was slow and dumb? Well the algorithm has a name, and it’s called Dijkstra’s Algorithm, which is also known as a Breadth First Search.

Example of Node Traversal

Above is a picture example of how it searches a graph, each node is a wiki page in our problem. If you notice it checks all the nodes at each level before going to the next. I.E. We check every link on the Alan Turing wiki page before we can search the next level of links. This means this—

Searching the Snow White Links for links to “Data”

Pardon the debug output, but what’s happening here is that I am trying to get from Alan Turing to Data, but we are busy look at Snow White and the Seven Dwarfs Movie from 1937. A normal person has “intelligence” and would assume, and rightfully so, that searching here for a math/computer related topic is pointless. And that is correct, but computers are not ”intelligent”. I’ll spare you the details of what “intelligence” means in this paragraph, but I want to highlight that the computer is not able to “know” that this link is a dead end, but has to search it anyways.

There is a solution of sorts that makes the algorithm “smarter” called A* (A-star) search, but it relies on a way to “know” where to go. A compass of sorts to direct us in the right direction. But I could not figure out a solution as I would have had to delve into Natural Language Processing which is something too complex for the time frame I have to make this program.

But my bot works! And I call that a win. While it could be improved upon I accomplished what I set out to do which was to solve the wiki game with an optimal link path.

--

--

David Miller

A comp sci major that likes to bend technology to my will