Regular Expressions and Flare Markup (Danger)

Danger ahead…

This post does is not a recommendation to use regular expressions for automation which involves XML and XHTML parsing. Unless you are watching the action or verifying the results later, there are dangers to using regular expressions to parse XML and XHTML. This is discussed extensively on developer forum websites. Computer science curricula which describe language theory may even include formal proofs which demonstrate why parsing XML and XHTML with traditional regular expressions is impossible.

There is a particularly colorful post about the subject by Jeff Atwood called “Parsing Html The Cthulhu Way.”

Flare has a find and replace feature which supports wildcards and regular expressions. The feature includes an option to search source code. In other words, you can search the text of a topic or the underlying markup. Flare’s find and replace works well for project-specific needs. Analyzer includes more advanced features to find items. This is a better option if your goal is to change markup. But the functionality is also project-specific. Since Flare artifacts are text files, non-Flare text editors can be used to edit Flare artifacts. And many editors support regular expressions.

If you maintain content in a small number of projects, find and replace in Flare may be sufficient. But if you use many projects to maintain your content, an external find and replace may come in handy. When you step outside of Flare, find and replace are limited to the XML and XHMTL of the Flare artifacts. When working within Flare, find and replace works within the context of the XML (semi-WYSIWYG) editor as well as in the context of the text editor (true XML).

Many programs which support text find and replace also support scripting. Visual Studio includes find replace functionality which can be scripted through Visual Studio macros. The .NET framework has its own flavor distinct from Visual Studio. Many scripting languages and IDEs support find and replace. Most have their own variation of syntax and functionality.

XML and XHTML are text. Regular expressions parse text. It is tempting to use regular expressions to parse XML and XHTML. But when it comes to XHTML and XML, regular expressions have limitations. In computer science terms, traditional regular expressions describe a regular language. XHTML and XML are not regular languages. A regular language can not fully parse a non-regular language. In particular, regular expressions have difficulties with nested tags.

Many languages called regular expressions exceed the scope of traditional regular expressions. For example, some languages called regular expressions support recursion and backtracking. This can be exploited to overcome issues with nested tags.

Since Flare projects are openly maintained in XML, XHTML, and CSS text files, there are patterns in Flare artifacts which can be exploited through regular expressions. For XML and XHTML artifacts, the formatting for markup is similar. There is an opening and closing tag which wraps content.

The obstacle to changing elements with markup such as XML is to change both the opening tag and the corresponding closing tag. The natural desire is to identify the content of the tag in such a way that the closing point of the content is also identified. Another obstacle is implementations of regular expressions may differentiate line ends from other characters. There is potentially more than one line end in any given element. So that must also be considered.

The issue with line ends can be handled. Some issues with nested tags can be handled. But as an overall solution for XML and XHTML parsing, regular expressions are not appropriate. Here is a demonstration as to why.

Let’s begin with a simple regular expression to find the opening tag for a p element. Assume we have a document which contains this element in the source: <p>Example</p>. In the Find field of a find and replace feature, enter <p>. With Visual Studio, if you attempt to find with this value, with no options selected, the opening p tag will be found and highlighted. If Use: Regular expressions is selected, p will not be found. The greater than and less than symbols must be escaped if Regular expressions are used. To escape a character in Visual Studio regular expressions, precede it with \. So adjust the find field to:

 \<p\>

With Use: Regular expressions selected, <p> will be found and highlighted. But if Use: Regular expressions is not selected, a message will appear which says:

The following specified text was not found: \<p\>

So far, we have no problems. We sought to find the opening tag for a p element with no attributes and no whitespace in the tag. In order to find the rest of the content in the opening tag’s first line, we can adjust the regular expression to:

\<p\>.*

This expression uses . to represent any character other than line breaks and * to represent unlimited repetitions of the preceding item. The effect is to return everything until there is a line break. With Use: Regular expressions selected, the entire line is found:

<p>example</p>

If the goal is just to find content starting with a particular tag which spans one line, that technique is sufficient. But there are two more considerations. Firstly, line breaks are still an issue. Let’s look at an element with a line break:

<p>example
</p>

The same find with Use: Regular expressions would find:

<p>Example

The line break excluded everything after the first line from the find. In Visual Studio regular expressions, a line break is represented by: \n

and “or” is represented with: |

Changing the find to:

\<p\>(.|\n)*

Will return everything after and including the opening p tag. But the expression does not stop until the end of the file. The expression will find everything including <p> and after. To terminate the find text at another point, we can specify not to include a character. Visual Studio regular expressions indicate any one character not in a set with:

[^…], where the ellipses is the set of characters

We can use this to specify a stopping point for the find this way:

\<p\>([^<]|\n)*

Now the find will return every thing from the opening p tag up to the first closing tag in the markup, any closing tag. And herein lies the problem. Elements can be nested in elements. For example, the p element may contain a span element or an a element. You could continue down this road for a while. But the more you adapt (complicate) a regular expression to handle situations, the less elegant the solution becomes.

In short, regular expressions work well for some plain text problems. With markup, regular expressions work well for innermost elements, which are elements which contain no other elements. Regular expressions also work well for changing attributes for every kind of element. For example, name=” for p, pre, b, and every other element. But for true XML and XHTML parsing, regular expressions are not reliable.

3 comments

  1. Since we’re talking about scripting/programming here, a much better option is using one of the many HTML parsing libraries out there. One for Ruby that I find very useful is nokogiri. Add this gem to your script, specify what you need, and you have the content free of HTML markup. I find this a much better (and foolproof) way to get and modify HTML content (also works with XML content).

    1. Good point and thanks for the tip on the Ruby resource. I hope to explore several parsing libraries with this blog. Nokogiri looks very promising. But I felt a post about regular expressions and markup is important early in the blog because a form of regular expressions is exposed in the Flare UI and it is very tempting to go down that road when one doesn’t know the pitfalls.

  2. I’d like to know more about pulling out individual topics from Flare output and automatically including them on another web page that I’m building. Is that possible?

Leave a comment

Your email address will not be published.

HTML tags are not allowed.

254,296 Spambots Blocked by Simple Comments