One of the most annoying things about using Google Docs is that none of the styles are inline. It used to be that bold text was wrapped in a <strong> tag and italic text was wrapped in am <em> tag. No longer. Now each style of text is wrapped in a span with a number of different classes applied to it. Those styles don’t carry through when we bring the text into WordPress and the names of the classes vary from article to article. This can be very annoying for columnists who bold names of subjects, for example.
So, what I’m looking for is a regex expression to turn <span class=”c0 c3″>My text</span> into <span class=”c0 c3″><strong>My text</strong></span> where class c3 is the bold class, for example.
I assume you’re doing this in PHP… Did you consider using something like Simple HTML DOM instead of a regex? Might come in handy later down the line if you have to transpose more cases.
http://simplehtmldom.sourceforge.net/
I’m bad at regex, too. But, could you use something like phpQuery for this? http://code.google.com/p/phpquery/
That is, using jQuery like syntax, look for the span.c3 and then insert the tags appropriately?
I meant: “insert the [strong] tags appropriately”
As Rob mentions, Simple HTML DOM is great for this too. Similar to phpQuery.
Parsing HTML with regular expressions is really difficult (and Wrong), especially in this case where you then need to do things like matching tags that could potentially be nested.
If you ignore this situation:
<span class="c0 c3">My <span>text</span>is bold </span>
then the regular expression is simple:$content = preg_replace( '#<span class="c0 c3">(.*?)</span>#s, '<span class="c0 c3"><strong>$1</strong></span>', $content );
Note the “s” pattern modifier being used to ensure that “.” matches line breaks between opening and closing span tags: http://php.net/manual/en/reference.pcre.pattern.modifiers.php. And the ? makes it non-greedy that way we stop at the very next closing span tag.
The problem is that the the classes change name and there’s not always the same number of classes. So where I’m having trouble is finding a variable class name in a list of a variable number of classes.
For example, it could be <span class=”c0 c3″>, or it could be <span class=”c0″>, or it could be <span class=”c0 c3 c1″>.
Luckily spans are never nested.
So maybe something like(?):
$content = preg_replace( ‘#<span class=”(.*?)’ . $boldclass . ‘(.*?)”>(.*?)</span>#s, ‘<span class=”$1$boldclass$2″><strong>$3</strong></span>’, $content );
To clarify, something like:
$elem = $html->find(‘.c3’);
$elem->outertext = ‘‘ . $elem->outertext . ‘‘;
Bah, sorry, code got mangled. This is what I meant to say
$elem = $html->find('.c3');
$elem->outertext = '' . $elem->outertext . '';
Ugh. One more time: https://gist.github.com/1183666
Thanks!