Flare to MediaWiki to Flare (part 4, Starting to Map MediaWiki Markup to Flare Topic XML)

Parsing the wiki markup for MediaWiki wikis is a challenge. There is not a markup spec to which to build a parser. But parsers exist. There would be no Wikipedia without one.

The original plan for this blog series was to use an existing libary to parse wikitext. That may still be the solution. But to convert one markup to another, a good map comes in handy. Here is the simplest version of the map: MediaWiki wikitext > Flare topic XML. But the nuts and bolts are much more complicated.

There is a wiki page on the MediaWiki.org website which discusses the possibility of a spec: Markup spec. The conclusion I draw from reading that page is that the rules for wikitext on a MediaWiki wiki is defined by the parser for a MediaWiki site. In other words, if you want your wiki markup to work, determine the parser behavior.

Writing markup to suit the needs of a parser isn’t especially shocking. Web developers constantly tweak their HTML to accommodate the whims of web browser makers. At least with this markup, there is really only one parser about which to be concerned, the MediaWiki parser. How does that parser behave?

The Markup spec page describes the parser’s actions. Notably, the outline describes a preprocessor, a parser, and a save parser with different behaviors. This is a potential model for different parsing behaviors in the round trip. A section of the page describes parts of the markup language. For example, content within equals signs (=…=) is a first level heading. That will probably be mapped to <h1>…</h1>, possibly with an attribute to indicate the heading is derived from wikitext: <h1 class=”MediaWikiFirstLevelHeading”>…</h1>. Then again, one could just treat all h1 tags the same. We’ll see as we get deeper into the mapping.

To avoid rework, decisions about how to convey metadata in the wikitext to metadata in the Flare topic XML should be made at this point. If conveying metadata is to rely on markup, conveying metadata is a part of the mapping process. An alternative is to store that information in a separate repository, like a database. But maintaining the information in the Flare topic seems like a more convenient option at this point.

For the next post, I’ll build a table with three columns: one for the wikitext tokens, one for the Flare topic XML tags, and one to explain the mapping behavior. That process will go on for a few posts until the mapping is satisfactory.

Leave a comment Cancel reply

JavaScript must be turned on to leave a comment.