MongoDB for WikiToLearn migration


Today i want to talk about my experience with the WikiToLearn migration.

The problem of every migration is getting your hands on the data in a way such that you can work on it.

Starting from the mysql backend and trying to have everything into a versioned object storage (python eve is the one we are tring now) is not an option.

The solution is to use a temporary database to keep the data, process the data in this temporary storage and afterwards uploading everything in the destination.

After some tries we managed to have the pipeline that reads all the MediaWiki pages, parses the structure and uploads everything in eve, using mongodb as a temporary storage.

But why mongo?

There are tons of databases and, after all, why use a dbms as temporary storage?

Well, the first thing is that mongodb is an implicit-schema dbms (there isn’t such thing as schema-less) and this is useful because you can add and remove fields at will without a full restructuration of the data-scheme and this can speed up quick hacks to test.

The point I want to make is that a DBMS is fast, you can try to build an in-memory representation of the data, I’ve tried, but it’s quite hard and quite slow or it’s another DBMS, so why not use an existing DBMS?

The next point is about persistence: when you have to work with a non trivial dataset, it is quite nice to be able to re-run only a part of the migration, as this speeds up the development.

Mongodb has also mongdump and mongorestore, which lets you snapshot everything and restore from a “checkpoint”.

I hope I’ve given you some good points to think about the next time you have to migrate from an old datastore to a new one.