See this MediaWiki link for more advanced formatting options
"Inline text" is the guts of Wikitext formatting. It covers every situation where "normal" text is allowed, such as image captions, table data, and to an extent (yet to work out how to enforce this restriction...), headings and link text.
<source lang="bnf">
<inline-text> ::= <inline-element> [<inline-text>]
<inline-element> ::=
| <category-link>
| <internal-link>
| <external-link>
| <magic-link>
| <image-inline> | <gallery-block> | <media-inline>
| <text-with-formatting>
<text-with-formatting> ::=
| <formatting>
| <inline-html>
| <noparseblock>
| <magic-word>
| <open-guillemet> | <close-guillemet>
| <html-entity>
| <html-unsafe-symbol>
| <text>
| <random-character>
| (more missing?)...
<text> ::= { <harmless-character> }+
</source>
Detail:
The parser should try the options in order, with text matching if all else fails. In particular, <category-link> should be matched before <link> because a category is a special type of link, and we don't want the normal parsing to occur.
The parser recognises validly constructed HTML entities and leaves them alone. <source lang="bnf">
<html-entity> ::= "&" <html-entity-name> ";"
| "&#" <decimal-number> ";"
| "&#x" <hex-number> ";"
<html-entity-name> ::= Sanitizer::$wgHtmlEntities (case sensitive)
(* "Aacute" | "aacute" | ... *)
</source>
These "unsafe" symbols are turned into HTML entities if they haven't matched part of a valid HTML entity above. It's probably not too efficient having single-character level matching rules...perhaps should be combined with "text". <source lang="bnf">
<html-unsafe-symbol> ::= <unescaped-ampersand> | <unespaced-less-than> | <unescaped-greater-than> <unescaped-ampersand> ::= "&" <unescaped-less-than> ::= "<" <unescaped-greater-than> ::= ">"
</source>
&
<
>
Harmless-characters mean characters that couldn't be anything else. I'm not sure how useful this is as a distinction, but perhaps it will help speed things up?
A "random character" is any character which hasn't matched anything else.
<source lang="bnf">
<harmless-characters> ::= /[A-Za-z0-9] etc <random-character> ::= ? any character ... ?
</source>
Both types are written literally.
This section from the "fundamental elements" section...time to mangle!
<source lang="bnf"> <character> ::= <whitespace-char> | <non-whitespace-char> | <html-entity>
<whitespace> ::= <whitespace-char> [<whitespace>] | EOF
<newlines> ::= <newline> [<newlines>]
<space-tabs> ::= <space-tab> [<space-tabs>]
<whitespace-char> ::= <space-tab> | <newline>
<space-tab> ::= <space> | TAB <spaces> ::= <space> [<spaces>] <space> ::= " "
<newline> ::= CR LF | LF CR | CR | LF <BOL> ::= <newline> | BOF <EOL> ::= <newline> | EOF
<non-whitespace-char> ::= <letter> | <decimal-digit> | <symbol> <letter> ::= <ucase-letter> | <lcase-letter> <ucase-letter> ::= "A" | "B" | ... | "Y" | "Z" <lcase-letter> ::= "a" | "b" | ... | "y" | "z" <symbol> ::= <html-unsafe-symbol> | <underscore> | "." | "," | ...
<underscore> ::= "_"
<decimal-number> ::= <decimal-digit> [<decimal-number>] <decimal-digit> ::= "0" | "1" | ... | "8" | "9"
<hex-number> ::= <hex-digit> [<hex-number>] <hex-digit> ::= <decimal-digit>
| "A" | "B" | "C" | "D" | "E" | "F"
| "a" | "b" | "c" | "d" | "e" | "f"
</source>
Bold/italics is the biggest problem with switching to a consume-parse-render parser. It will not be possible to describe the current, extremely esoteric rules in simple (E)BNF. The best we can hope for is to store tokens representing the apostrophe clumps and do a second pass to make more sense of them. It would be very useful to define a second, unambiguous set of formatting syntax (most likely // and **), and encourage people to use those wherever apostrophes and bold/italics meet.
Some rules for parsing bold/italics as recognised by the current parser. These must be implemented (Brion said so). In increasing order of complexity:
Optimistic view: <source lang="bnf">
<formatting> ::= <bold-italic-toggle> | <bold-toggle> | <italic-toggle> <bold-italic-toggle> ::= "" <bold-toggle> ::= "" <italic-toggle> ::= ""
</source>
Reality: <source lang="bnf">
<formatting> ::= <apostrophe-jungle>
<apostrophe-jungle> ::= "" { "'" }
</source>
The following describes the behaviour of repeated postrophes. "Bold" means "toggle bold", rather than "turn bold on". "Bold, italics" means "Toggle bold and italics independently", rather than "turn bold and italics on" or "toggle bold and italics the same way".
' ): Always a single apostrophe.
hello ' blah ) → hello ' blah
'' ): Always italics on or off
hello '' blah ) → hello blah
''' ):
hello ''' blah ) → hello blah
hello l'''amour'' l'''ouest''' blah ) → hello l'amour louest blah
hello mon'''amour'' blah ) → hello mon'amour blah
hello '''amour'' '''blah '''blah ) → hello 'amour blah blah
'''' ):
hello ''''amour''' now ''italics unbalanced, but that's ok ) → hello 'amour now italics unbalanced, but that's ok
hello ''''amour''' now, '''bold unbalanced, but that's ok ) → hello 'amour now, bold unbalanced, but that's ok
hello ''''amour''' now '''''bold and italics unbalanced, so invoke this special case ) → hello ''amour now bold and italics unbalanced, so invoke this special case
''''' ):
hello ''''' blah ) → hello blah
hello '''''''''' blah ) → hello ''''' blah
hello '''bold '''''''''' blah ) → hello bold ''''' blah
The parser recognises and cleans a large number of HTML tags, as defined in Sanitizer.php.
A decision has to be made here on whether to attempt to parse these things as a matched set, or whether to leave that to a later pass.
A loose definition assuming they are treated individually: <source lang="bnf">
<InlineHTML> ::= <InlineHTML-Open> | <InlineHTML-Close> | <InlineHTML-OpenClose> | <HTMLComment>
<InlineHTML-Open> ::= "<" <InlineHTMLtagname> [<extra-characters>] ">"
<InlineHTML-Close> ::= "</" <InlineHTMLtagname> [<extra-characters>] ">"
<InlineHTML-OpenClose> ::= "<" <InlineHTMLtagname> [<extra-characters>] "/>"
<extra-characters> ::= <word-boundary-char> {characters - ">"}
<word-boundary-char> ::= " " | "-" | ":" | " " | "\"" | "/" | "*" | "#" | "!" | "$" | "%" | ...
</source>
if( preg_match( '!^(/?)(\\w+)([^>]*?)(/{0,1}>)([^<]*)$!', $x, $regs ) ) {
The significance of these groupings is shown as follows:
A <blockquote> B <span>C </blockquote> D </span> E
Here, blockquote and span are both "nesting" tags. When the close-blockquote tag is found inside the span block, it is escaped.
This doesn't work:
<span>Some text [[Image:foo.jpg|close </span>it.]]
But this does:
<b>Some text [[Image:foo.jpg|close </b>it.]]
<InlineHTMLTagname> " " <sanitized-attributes> > etc.
This is pretty trivial and used basically to improve the appearance of punctuation in French, which always places a space before certain punctuation, and places spaces inside guillemets. Other languages use these characters, but without the spaces. Currently performed directly in the parse() method.
<nbsp-before> ::= [any character] <space> ("»" | "?" | ":" | ";" | "!" | "%")
<nbsp-after> ::= "«" <space>
  string.
Not to be confused with magic links. These seem to be able to be used virtually anywhere: a table of contents in an image caption even works. See m:Help:Magic words. <source lang="bnf">
<magic-word> ::= <magicword-toc> | <magicword-forcetoc> | <magicword-notoc> | <magicword-noeditsection> | <magicword-nogallery>
<magicword-toc> ::= mw("toc")
<magicword-forcetoc> ::= mw("forcetoc")
<magicword-notoc> ::= mw("notoc")
<magicword-noeditsection> ::= mw("noeditsection")
<magicword-nogallery> ::= mw("nogallery")
<magicword-defaultsort> ::= <openmagicvariable>, mw("defaultsort"), <defaultsort-key>, <closemagicvariable>
(* I don't really get how these work... *)
<openmagicvariable> ::= "{{{"
<closemagicvariable> ::= "}}}"
(* defaults, i->case insensitive, s->case sensitive *)
mw("notoc") ::= ""i
mw("forcetoc") ::= ""i
mw("notoc") ::= ""i
mw("noeditsection") ::= ""i
mw("nogallery") ::= ""i
mw("defaultsort") ::= "DEFAULTSORT:"s | "DEFAULTSORTKEY:"s | "DEFAULTCATEGORYSORT:"s
</source>
Notes:
languages/messages/MessagesXx.php where Xx is the language.
Links to images and media should be handled as normal links. It's inline images and media that are being dealt with here.
Originally from MetaWiki.
<source lang="bnf">
ImageInline ::= "[[" , "Image:" , PageName, ".", ImageExtension, ( { <Pipe>, ImageOption, } ) "]]" ;
ImageName ::= PageName, ".", ImageExtension
ImageExtension ::= "jpg" | "jpeg" | "png" | "svg" | "gif" | "bmp" ;
ImageOption ::= ImageModeParameter | ImageSizeParameter | ImageAlignParameter
| ImageVAlignParameter | Caption
ImageModeParameter ::= ImageModeManualThumb | ImageModeThumb | ImageModeFrame | ImageModeFrameless
ImageModeManualThumb ::= mw("img_manualthumb");
ImageModeAutoThumb ::= mw("img_thumbnail");
ImageModeFrame ::= mw("img_frame");
ImageModeFrameless ::= mw("img_frameless");
/* Default settings: */
mw("img_manualthumb") ::= "thumbnail=", ImageName | "thumb=", ImageName
mw("img_thumbnail") ::= "thumbnail" | "thumb";
mw("img_frame") ::= "framed" | "enframed" | "frame";
mw("img_frameless") ::= "frameless";
ImageOtherParameter ::= ImageParamPage | ImageParamUpright | ImageParamBorder
ImageParamPage ::= mw("img_page")
ImageParamUpgright ::= mw("img_upright")
ImageParamBorder ::= mw("img_border")
/* Default settings: */
mw("img_page") ::= "page=$1" | "page $1" ??? (where is this used?)
mw("img_upright") ::= "upright" [, ["=",] PositiveInteger]
mw("img_border") ::= "border"
ImageSizeParameter ::= mw("img_width");
/* Default setting: */
mw("img_width") ::= PositiveNumber "px" ;
ImageAlignParameter ::= ImageAlignLeft | ImageAlign|Center | ImageAlignRight | ImageAlignNone
ImageAlignLeft ::= mw("img_left")
ImageAlignCenter ::= mw("img_center")
ImageAlignRight ::= mw("img_right")
ImageAlignNone ::= mw("img_none")
/* Default settings: */
mw("img_left") ::= "left"
mw("img_center") ::= "center" | "centre"
mw("img_right") ::= "right"
mw("img_none") ::= "none"
ImageValignParameter ::= ImageValignBaseline | ImageValignSub | ImageValignSuper | ImageValignTop
| ImageValignTextTop | ImageValignMiddle | ImageValignBottom | ImageValignTextBottom
ImageValignBaseline ::= mw("img_baseline")
ImageValignSub ::= mw("img_sub")
ImageValignSuper ::= mw("img_super")
ImageValignTop ::= mw("img_top")
ImageValignTextTop ::= mw("img_text_top")
ImageValignMiddle ::= mw("img_middle")
ImageValignBottom ::= mw("img_bottom")
ImageValignTextBottom ::= mw("img_text_bottom")
/* By default: */
mw("img_baseline") ::= "baseline"
mw("img_sub") ::= "sub"
mw("img_super") ::= "super" | "sup"
mw("img_top") ::= "top"
mw("img_text_top") ::= "text-top"
mw("img_middle") ::= "middle"
mw("img_bottom") ::= "bottom"
mw("img_text_bottom") ::= "text-bottom"
Caption ::= <inline-text>
</source>
<img> tag.
<source lang="bnf">
MediaInline ::= "" , "Media:" , PageName "." MediaExtension "" ; MediaExtension = "ogg" | "wav" ;
</source>
<source lang="bnf">
GalleryBlock ::= "GalleryImage ::= (to be defined: essentially foo.jpg[|caption] )
</source>
Remarks: