|
Extracting
data (Parsing Rules)
Parsing rules in Advanced mode
Rules
Skip To Text
between [skip target].
Skip To Text advances the text position
forward until it encounters [skip target].
The cursor is placed thereafter.
Get Text
between [text
before target] and [text
after target]. Column
name for result: [column
name].
Get text harvest all strings in the web
page source between the [text
before target] and [text
after target]. Each result is placed on
a new row in the same column.
Get Text Until
read (store) text until [text
after target] is found. Column
name for result: [column
name].
Get Text Until reads text from the
current position forward until it
encounters [text
after target]. The read text is stored
in the column [column
name].
This is useful when, after a Skip To
Text rule, you want to start extracting
data directly from this text position.
Get Text can not do this since you would
need a [text
before target].
Get Next URL.
Column name for URL: [column
name]. Column name for
Desrcription: [column
name].
Get Next URL finds the next url (href
definition) from the current page
position. It stores the url itself (the
href part) in the URL column and the
description of the link (the part
between <a href=...> and </a>) in the
Description column. Url's are
automatically completed to full path
urls (i.e.
http://www.server.com/img/img001.jpg
instead of ../img001.jpg).
Get Next Picture
of type [picture
type]. Save files in folder [folder
name]. Column name for
results: [column
name].
Get All Pictures downloads all pictures
in <IMG> tags and saves it in [folder
name] on your hard disk. You can save
all IMG-es or just the type specified in
[picture type]. The resulting file name
is stored in column [column name].
Text Block:
refer to the page explaining the Text
Block here.
Repeat:
refer to the page explaining the Repeat
structures here.
Download Next Web
Link: refer to the page
explaining this rule here.
Save URL.
Column name for result: [column
name].
Simply saves the current page's url.
Useful when you want to keep track from
what page the original data was
extracted.
Show Message
with content [message
text].
Pops up a message box with either Text
Position, Last Harvested Text or Last
Text Block content. Primarily used for
debugging script. You can see if the
Text Position advances in a repeat
structure or inspect the contents of a
text block.
Variables
skip target:
the text directly preceding the point
you want to skip to.
text before target:
the text directly preceding the thing
you want to find. Use text that is
unique to this result (e.g. <td> is not
a good string, make it more specific).
You can use * to represent any string in
between. In this fashion "<a
href=*>*</a>" matches all hyperlinks in
a document. To harvest all links in a
web page for example, use Get Text
between <a href=*> and </a>.
text after target:
the text directly after the target you
are trying to find.
column name:
The name for the results column; appears
above results. Use the same name to put
results in the samed column.
picture type: use ALL if you want
all picture types, or JPG or any other
one for just that type.
folder name: a directory on your
hard drive.
message text:
pop up a message box with either Text
Position, Last Harvested Text or Last
Text Block content.
Next >
|