Extracting data (Parsing Rules)


Advanced mode: Repeat structures


In Advanced Mode rules are applied only once, so we need repeaters that enable us to get more than one result.

The Repeat structure works fairly simple. All rules between Repeat Begin and Repeat End are repeated until the text position does not advance anymore. In other words, the repeat loop repeats until all rules fail. Some examples:

Repeat Begin
   Get Text between <a href=*> and </a> ...
Repeat End

This script will harvest all urls from the source document until no more url's are found, or, in other words, until the Get Text rule fails to find "<a href=*>". Now slightly more difficult:

Repeat Begin
   Get Text between <a href=*> and </a> ...
   Get Text between <img src="> and "> ...
Repeat End

This piece of script first finds a href, from there it searches and image source (img src), then, from that position, again a href, then... etc. Now suppose that at a certain point in the page there are more urls to come, but not any images. The repeat loop will continue nevertheless. Remark that this piece of script will not necessarily harvest all href's and url's - it jumps from href to img to href etc.

You will have noticed that Get Text rules place extracted text nicely in the right column. By setting the column name, you can influence where to put the data you are after. Controlling rows works differently. Each Repeat End rule will let the data grid go the the next row. The consequence of this is that you should place all the info you want in one row (but different columns) within a repeat statement. If you do not do this, the effect is that doing a Get Text twice to the same column, the first result gets overwritten.
 
Repeat statements can be nested. You can place a Repeat within a Repeat like this:

Repeat Begin
   Repeat Begin

   Repeat End
Repeat End 

Next >