|
Extracting
data (Parsing Rules)
Advanced mode: Text Block structures
The text block structure lets you single
out a piece of source text for further
processing. Any rules you apply will
only be run within the bit of text that
the text block denominates. Here is a
text block definition:
Text Block Begin
Text to start block: <tr. Text to
end block: </tr>
Get
Text between <b> and
</b> ...
Text Block End
It would single out the table rows
(td's) text within the page (between
<tr> and </tr>). The Get Text rule is
applied within this text block,
effectively harvesting the first bold
text within the first tr found. After
Text Block End, the page position is
placed after the text to end block, in
this case after "</tr>".
We can make things more interesting, and
extract the bit of bold text from all
table rows on the page. We would use:
Repeat Begin
Text
Block Begin Text to start block:
<tr. Text to end block: </tr>
Get Text between <td*> and
</td> ...
Text
Block End
Repeat End
Now, HH singles out the first occurence
of a table row (text between <tr> and
</tr>), and within that text, it finds
the first piece of bold text. Please
open "example16.hhp", and give it a
run. You will see that it only harvests
the country name.
There is a important difference between
a script with or without such a text
block. If you were to remove the Text
Block definition in the above script:
Repeat Begin
Get Text between <td*> and
</td> ...
Repeat End
and run this in HH (go ahead, remove the
text block rules from the script!). Now
it gives all data within a table
data (td)! You simply get all data in
the page between <td> and </td>, not
just the country (= first TD within a
TR).
Remark the smart use of the "*" in
"<td*>" which enables you to catch a
"<td>" but "<td valign=top>" as well.
Text blocks can not be
nested, so use them effectively.
Next >
|