|
Extracting
data (Parsing Rules)
Parsing rules in Normal mode
Rules
Get Text
between [text
before target] and [text
after target]. Find [Number of
instances] instance(s). Column name for
results: [column
name].
Get text harvest all strings in the web
page source between the [text
before target] and [text
after target]. Each result is placed on
a new row in the same column.
Get Table data
from HTML table. Use table number [Table
Number].
Gets all data from a particular html
table. For instance, if is 2, it will
harvest the second html table in the
source. It will extract all data from
that table and place it in the
appropriate columns and rows. Very fast
and easy function. It often works well
to not even inspect the source for the
table number, but just to start at
number 1, hit Harvest and see if the
data rolls in. If it doesn't, try table
nr 2. And so on until you get the data
you are after.
Get Pictures
between [text
before target] and [text
after target] of type [picture
type]. Save files in folder [folder
name]. Column name for
results: [column
name].
Get Pictures downloads all pictures in
<IMG> tags and saves it in [folder name]
on your hard disk. It only scans text
placed between [text
before target] and [text
after target]. You can save all IMG-es
or just the type specified in [picture
type]. The resulting file name is stored
in column [column name].
Get All Pictures
of type [picture
type]. Save files in folder [folder
name]. Column name for
results: [column
name].
Get All Pictures downloads all pictures
in <IMG> tags and saves it in [folder
name] on your hard disk. You can save
all IMG-es or just the type specified in
[picture type]. The resulting file name
is stored in column [column name].
Save url. Column name for
results: [column
name].
Records the url of the current web page
that is being harvesting. The url is stored in column [column name].
It provides you with a reference which
results are harvested from which web
page (url).
Save source
file in folder [folder
name]. Column name for
results: [column
name].
Saves the complete web page source in a
file on your hard disk. The file name is stored in column [column name].
Pages are named page0.html to
page[n].html.
Variables
text before target:
the text directly preceding the thing
you want to find. Use text that is
unique to this result (e.g. <td> is not
a good string, make it more specific).
You can use * to represent any string in
between. In this fashion "<a
href=*>*</a>" matches all hyperlinks in
a document. To harvest all links in a
web page for example, use Get Text
between <a href=*> and </a>.
text after target:
the text directly after the target you
are trying to find.
column name:
The name for the results column; appears
above results. Use the same name to put
results in the samed column.
Number of
instances: the maximum number of
results you want to find. Just put a
large number if you want to find as many
as possible (e.g. 10000).
picture type: use ALL if you want
all picture types, or JPG or any other
one for just that type.
folder name: a directory on your
hard drive.
Table
Number:
tells the function to harvest data from
the n-th table in the html source, where
n is the Table Number.
Next >
|