|
Extracting
data (Parsing Rules)
Advanced mode: Repeat structures
In
Advanced Mode rules are applied only
once, so we need repeaters that enable
us to get more than one result.
The Repeat structure works fairly
simple. All rules between Repeat Begin
and Repeat End are repeated until the
text position does not advance anymore.
In other words, the repeat loop repeats
until all rules fail. Some
examples:
Repeat Begin
Get
Text between <a href=*>
and </a> ...
Repeat End
This script will harvest all urls from
the source document until no more url's
are found, or, in other words, until the
Get Text
rule fails to find "<a href=*>". Now
slightly more difficult:
Repeat Begin
Get
Text between <a href=*>
and </a> ...
Get
Text between <img src=">
and "> ...
Repeat End
This piece of script first finds a href,
from there it searches and image source
(img src), then, from that position,
again a href, then... etc. Now suppose
that at a certain point in the page
there are more urls to come, but not any
images. The repeat loop will continue
nevertheless. Remark that this piece of
script will not necessarily harvest all
href's and url's - it jumps from href to
img to href etc.
You will have noticed that
Get Text
rules place extracted text nicely in the
right column. By setting the column
name, you can influence where to put the
data you are after. Controlling rows
works differently. Each
Repeat End
rule will let the data grid go the the
next row. The consequence of this is
that you should place all the info you
want in one row (but different columns)
within a repeat statement. If you
do not do this, the effect is that doing
a Get Text
twice to the same column, the first
result gets overwritten.
Repeat statements can be nested. You can
place a Repeat within a Repeat like
this:
Repeat Begin
Repeat Begin
Repeat End
Repeat End
Next >
|