Extracting data (Parsing Rules)


Introduction


After you have set up the Data Sources you have to think about extracting data. HH provides two ways of doing that. Both use "rules" to help extract data. We will go into each separate rule later. First we will tell how both ways differ from each other.

HH generates pages of text from the settings in data sources. These pages are presented in turn to the parsing rule. So, for example, first the parsing rules are applied to page1.html, then to page2.html, etc. Just until there are no more pages or files to harvest. The parsing rules extract data from each page in turn.

Normal Mode
Parsing rules in normal mode is the most straightforward way to go about extracting data. Each rule is applied from the beginning of the document. So, if we have two rules in Parsing rules (normal mode) HH will execute both rules from the start of the page. This may seem insignificant now, but bare it in mind as it will get important later on.

Advanced mode
Harvesting in advanced mode is also done by applying rules. The main difference is that as a rule is applied, the position in the page is remembered. This means that you can set up far more advanced harvests than would be possible in normal mode.

To get acquainted with HH, we suggest you learn about the Normal mode first before you try the advanced mode.


Next >