Appeal for help, because I’m terrible at regex

One of the most annoying things about using Google Docs is that none of the styles are inline. It used to be that bold text was wrapped in a <strong> tag and italic text was wrapped in am <em> tag. No longer. Now each style of text is wrapped in a span with a number of different classes applied to it. Those styles don’t carry through when we bring the text into WordPress and the names of the classes vary from article to article. This can be very annoying for columnists who bold names of subjects, for example.

So, what I’m looking for is a regex expression to turn <span class=”c0 c3″>My text</span> into <span class=”c0 c3″><strong>My text</strong></span> where class c3 is the bold class, for example.

This entry was posted in Uncategorized. Bookmark the permalink.

11 Responses to Appeal for help, because I’m terrible at regex

  1. Rob Flaherty says:

    I assume you’re doing this in PHP… Did you consider using something like Simple HTML DOM instead of a regex? Might come in handy later down the line if you have to transpose more cases.

    http://simplehtmldom.sourceforge.net/

  2. Bill says:

    I’m bad at regex, too. But, could you use something like phpQuery for this? http://code.google.com/p/phpquery/

    That is, using jQuery like syntax, look for the span.c3 and then insert the tags appropriately?

  3. Andrew Nacin says:

    Parsing HTML with regular expressions is really difficult (and Wrong), especially in this case where you then need to do things like matching tags that could potentially be nested.

    If you ignore this situation: <span class="c0 c3">My <span>text</span>is bold </span> then the regular expression is simple:

    $content = preg_replace( '#<span class="c0 c3">(.*?)</span>#s, '<span class="c0 c3"><strong>$1</strong></span>', $content );

    Note the “s” pattern modifier being used to ensure that “.” matches line breaks between opening and closing span tags: http://php.net/manual/en/reference.pcre.pattern.modifiers.php. And the ? makes it non-greedy that way we stop at the very next closing span tag.

    • William P. Davis says:

      The problem is that the the classes change name and there’s not always the same number of classes. So where I’m having trouble is finding a variable class name in a list of a variable number of classes.

      For example, it could be <span class=”c0 c3″>, or it could be <span class=”c0″>, or it could be <span class=”c0 c3 c1″>.

      Luckily spans are never nested.

      So maybe something like(?):
      $content = preg_replace( ‘#<span class=”(.*?)’ . $boldclass . ‘(.*?)”>(.*?)</span>#s, ‘<span class=”$1$boldclass$2″><strong>$3</strong></span>’, $content );

  4. Rob Flaherty says:

    To clarify, something like:

    $elem = $html->find(‘.c3′);
    $elem->outertext = ‘‘ . $elem->outertext . ‘‘;

  5. Pingback: Updated Docs to WordPress plugin: Now with better formatting | BDN Development

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>