/**************************
* GOLD SEEKER
* Data Extraction Tool
**************************/
== README FILE ================================================================
A generic data extraction script
Script under GNU LGPL license (http://www.gnu.org/copyleft/lesser.html)
This script is provided without any warranties, even implied.
Use it at your own risks.
It is a developpement version, uncommented, undebugged, and unfinished.
This documentation is also under construction.
GoldSeeker is a data extraction application. It was built to extract formatted data
from HTML files, but can be used with all kind of files.
Its behaviour is defined by a rule-based configuration file. It can process files on
the local server, or directly get web pages via http://.
GS is still in development, it's neither whole nor stable; nevertheless it can
already be used for simple extractions.
Use
===
Edit GSParser.php and set your mysql connection parameters (if you don't want to use
database export just comment out the "dbConnect();" line). Check that the include paths
are correct, and launch sample1.php.
GS parameters:
<?
include('GSparser.php');
$gs = new GSParser(config file path, data source path, data source type);
$gs->parse();
?>
Data source type: 'singleFile' for... well, a single file; 'listOfFiles' for an
array of filenames; 'directory' for a whole directory.
Source path : file path or array of file paths.
Contact
=======
For all questions, comments, contributions or others, thanks to contact me via
my sourceforge page here :
https://sourceforge.net/projects/goldseeker
Or, alternatively, at quentin.bouvart@gwd.fr.
All feedback is welcome!
GS Public Methods
=================
$gs->setVerbose(boolean)
------------------------
Tells GoldSeeker wether to display its results to the browser.
Config files format
===================
For more infos see the samples : files sample1.gs, sample2.gs, sample1.php for the
first set, sample.php, sample.data and sample.gs for the second one.
The first set gets the titles of the 10 first sites listed by google for the "test"
query. To export these to your db you'll need the following table :
CREATE TABLE `googleTitles` (
`id` int(11) NOT NULL auto_increment,
`title` varchar(255) default NULL,
PRIMARY KEY (`id`)
) TYPE=MyISAM AUTO_INCREMENT=1 ;
The second set doesn't require any table.
Single variable extraction
--------------------------
<<locate name=>>
<<begin>><</begin>>
<<end>><</end>>
<</locate>>
name is how the variable will be named, it'll be used to export it.
Place between the <<begin>> tags the string that occurs just before the value to
extract, and between the <<end>> tags the string following it. It is recommended
to replace the line breaks by \n or \r.
OR Operator
-----------
<<locate name=>>
<<begin>><p class="p_grey"><<or>><p class="red"><<or>><p class="green"><</begin>>
<<end>></p><</end>>
<</locate>>
The <<or>> operator allows several possibilities to be specified between the
<<begin>> and <<end>> tags.
Extraction of a variable following another
------------------------------------------
<<locate name= startsafter=>>
<<begin>><</begin>>
<<end>><</end>>
<</locate>>
Put in startsafter the name of the previous variable, preceded by a $ (NOT within
double quotes).
Loop extraction
---------------
<<locate name=>>
<<begin>><</begin>>
<<section name=>>
<<begin>><</begin>>
<<end>><</end>>
<<endofline>><</endofline>>
<</section>>
<<end>><</end>>
<</locate>>
Here the name parameter of the locate tag isn't mandatory; the value that will be
affected to this variable is the whole loop. The parameter endofline is also optional,
it's useful only when the begin and end tags aren't enough to detect the end of line.
Loop extraction of several variables
------------------------------------
<<locate name=>>
<<begin>><</begin>>
<<section name="field1,field2">>
<<begin>><</begin>>
<<end>><</end>>
<<begin>><</begin>>
<<end>><</end>>
<<endofline>><</endofline>>
<</section>>
<<end>><</end>>
<</locate>>
Database export of values
-------------------------
<<action insert table="" fields="" values=""/>>
The table has to exist.
Extraction with a nested loop
-----------------------------
<<locate>>
<<begin>><</begin>>
<<section>>
<<begin>><</begin>>
<<section name=>>
<<begin>><</begin>>
<<end>><</end>>
<<endofline>><</endofline>>
<</section>>
<<end>><</end>>
<<endofline>><</endofline>>
<</section>>
<<end>><</end>>
<</locate>>
Crawling (links following)
--------------------------
<<action crawl url="" configfile="">>
Displaying an extracted value [NOT IMPLEMENTED YET]
------------------------------------------------
<<action display="">>
Replacement of a value in the file [NOT IMPLEMENTED YET]
--------------------------------------------------------
<<action replace from= to=>>
Please report spelling/grammatical mistakes at
https://sourceforge.net/projects/goldseeker or quentin.bouvart@gwd.fr. Thanks!
|