|
Extracting
data (Parsing Rules)
General Tips and Tricks
The use of the
asterisk (*)
You can use the * wildcard in Normal
Mode (Get Text,
Get Pictures
rules) as well as Advanced Mode (Get
Text, Skip
To Text,
Get Text Until,
Text Block
rules). An asterisk skips zero or more
characters. For example
<a href=*>
matches <a
href="link.html"> with the *
matching
"link.html".
You can also use more than one
asterisk.
For example <a
href=*>*</a> matches
<a
href="link.html">My Link</a> with
the first * matching
"link.html" and the second
one My Link.
The
asterisk is used frequently in the
Get Text
rule. Suppose you have the following bit
of HTML:
<td
valign=top><font color=blue>first text
to extract</font></td>
<td valign=left><b>second text to
extract</b></td>
and we want to extract
first text to
extract and
second text to extract. You can
achieve this use the following rule:
Get Text
between <td *><*>
and </*></td>
This will extract the text regardsless
of what is put in the td definition and
regardless of font or b definitions (or
any definiton for that matter).
Debugging
If during building of a script you want
to see a slow motion play of how it
harvests, go to General Settings and
mark the option Indicate current row
while harvesting. Rules will now
highlite as they execute, making it
easier to see what happens. Make sure
you unmark the option before a real
harvest as this tends to slow a harvest
considerably.
Next >
|