|
Extracting
data (Parsing Rules)
Advanced mode
Introduction
The
first
parsing rule is applied from the
beginning of the page and, depending on
the rule, the position in the page after
the rule is applied is remembered. Any next rule is
thus applied
from the previous page position.
This enabled you to do complex data
extraction. We will start with a simple
example.
In HH, open from the examples directory
the profile "example13.hhp". Do a
preview so you will get an idea of the
page source. You will see one
Get Text
rule, similar as we saw in Normal Mode.
Then press Start Harvesting. You
will see one country name appear. Why
only one? In normal mode, HH produces
all strings found between [text
before target] and [text
after target]. In Advanced mode,
it extracts only one string and remebers
the position after [text
after target]. If you tell HH to
repeat this action, we will harvest all
countries. Open profile
"example14.hhp" to do this. Press
Start Harvesting again and you
will see all countries.
Now suppose we want to harvest all
coutries but coutries starting with an
A. We can use the
Skip To Text rule to accomplish
this. Open profile "example15.hhp".
The first rule skips to "Austria". The
text position is now placed after
this country. The following rule in the
repeat statement will now continue from
this position, and you will have skipped
all countries starting with an A.
Skip To Text
is one way to control what information
you want to harvest and what not. Use
Skip To Text
"<body" for example to start harvesting
from the BODY part of the HTML, skipping
the HEAD. Or use
Skip To Text "<table" twice to
skip the first two tables and go to the
third. This way you can start harvesting
the third table only. These kind of
mechanisms are what make the Advanced
Mode extremely powerful - and there is
more to come.
Next >
|