Задать вопрос

Тел: +7 965 3737 888





MediaWiki Markup

This is a copy paste job of mediawiki's syntax parser built in Python. You'll probably have to edit it to fit your needs

MediaWiki-style markup parse(text) -- returns safe-html from wiki markup code based off of mediawiki

Вопрос полезен? Да0/Нет0

Ответы (4):

Ответ полезен? Да0/Нет0

Ok, I officially give up. That parser has too many bugs to be usable. Suggestions to people reading this far: Use mwlib or parse the print-version of the article on wikipedia.

Ответ полезен? Да0/Нет0

There's several methods that are not defined here and need to be commented out or implemented. The only way to find them is by starting to parse a big number of articles, and see where to code crashes. Here's an implementation of findColonNoLinks, which is missing, translated to python from a php snippet I found while googling:

# Split up a string on ':', ignoring any occurences inside
# <a>..</a> or <span>...</span>
# @param string $str the string to split
# return string $colon_pos the position of the ':', or -1 if none found
# return string $before set to everything before the ':'
# return string $after set to everything after the ':'
def findColonNoLinks(str):
    # I wonder if we should make this count all tags, not just <a>
    # and <span>. That would prevent us from matching a ':' that
    # comes in the middle of italics other such formatting....
    # -- Wil
    pos = 0
    while True:
        before = after = ''
        colon_pos = str.find(':', pos)
        if colon_pos > 0:
            before = str[:colon_pos]
            after = str[colon_pos + 1]
            # Skip any ':' within <a> or <span> pairs
            a = before.count('<a')
            s = before.count('<span')
            ca = before.count('</a>')
            cs = before.count('</span>')
            if (a <= ca and s <= cs):
                # Tags are balanced before ':'; ok
            pos = colon_pos + 1
    return colon_pos, before, after;

You also need to change the two calls to it to this:

colon, term, t2 = findColonNoLinks(t)
if colon:

Ответ полезен? Да0/Нет0


Here's a modified _linkPat regexp to cater for the fact that links can contain non-ascii characters, and commas:

linkPat = re.compile(ur'^([A-Za-z0-9\s]+:)?([A-Za-z0-9.\,-\s\/\x80-\xFF]+)(?:\|([^\n]+?))?]](.*)$', re.UNICODE | re.DOTALL)

Ответ полезен? Да0/Нет0

Thanks -- just in time for a project of mine ☺