Kosowan.com Spiderfish FAQ 
Changing your world!
***********************
SpiderFish v2.0 README
***********************

SpiderFish is an all-purpose HTTP search-and-retrieval program that can be
configured for depth of search limits, lower limits of file sizes to download
and optional saving of text/html content encountered.

******************
License Agreement
******************

Copyright (C) 2003 Jason Kosowan

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA


*********
Features
*********

SpiderFish v2.0 represents a significant step forward in our design and
incorporates many new features, including:

- the ability to run SpiderFish in command-line mode

- control over the size limits of downloads

- user can optionally save text/html pages to disk as well as binary files

- user can turn off the automatic directory-creation based on hostname

- starting referer when reading initial webpage can be set

- new HTTP socket class ensures that SpiderFish doesn't hang when reading
  from a slow or stalled server


Features from v1.0 include

- control over logging levels as well as optional no-logging setting

- downloaded files are saved under subdirectories based on the hostname
  they were retrieved from

- User-selectable download directory

- User-selectable search level


****************
Getting Started
****************

Before running Spiderfish for the first time, an installation must be done.
Installation of the Spiderfish application is no different than installing
and running any other Java-based program. Hopefully this small instruction
page will be of help answering some basic questions. However, if you are
still stumped, please visit the Java Sun Pages (http://java.sun.com/) for
further details.


Step 1: Installing Java

In order for Spiderfish to run correctly, you must hava a Java Runtime
Environment (JRE) installed on your machine. The Runtime Environment installer
and all installation instructions can be found from Sun's Java site.

Alternatively, if your browser is set up for automatic installation, try
going to http://java.sun.com/getjava/. If the automated installation works, you
should see an installation progress and then an animated graphic of the Java
Coffee Cup logo.

You *must* have version 1.4 or greater to run SpiderFish.  If an earlier version
of Java is used, the SpiderFish application will display a mesage and exit.


Step 2: Installing Spiderfish

After the installation of the JRE, download the latest zip file of Spiderfish.
Unzip this into the directory where you want your Spiderfish installation to
reside. After the unzipping, notice three files: "spiderfish.jar", "spiderfish.bat" 
and "spiderfish.sh".

If your java installation is correct and up-to-date (minimum version is 1.4),
double-clicking the "spiderfish.jar" file should be enough to start SpiderFish.
If activating the jar-file does not run the application, you must use one of 
the included scripts, as detailed in the next sections.

If you are running Windows, the "spiderfish.bat" file can be used to start the
application. If you are running under *NIX or Mac OS/X, the "spiderfish.sh" can 
be used to start the application.  Pick the correct one for your operating system
and load it into an editor of your choice (e.g. Notepad, vi, Context, etc.)

In these scripts, you will notice two environment variables are being set
before the SpiderFish application is started:  PATH and CLASSPATH.  Both
variables need to be set correctly before SpiderFish can be run.


Step 3: Setting your PATH variable

Your PATH variable must include the location of the "java" executable on your
system (for Windows users, this is the "java.exe" file). For the default
installation of the Sun JRE in Windows, this is usually located under:
C:\Program Files\Java\j2re1.x.x\bin where x.x is the version number of your
JRE (the current version is j2re1.4.0)

As an example, if your JRE installation is located under

   C:\Program Files\Java\j2re1.4.0\bin

your new path in "spiderfish.bat" should be set like this:

   set PATH=C:\Program Files\Java\j2re1.4.0\bin;%PATH%


Step 4: Setting your CLASSPATH

Your CLASSPATH variable is used to tell Java where to look for program
information. To make Java aware of Spiderfish, the CLASSPATH variable should
point to the location of your spiderfish.jar file.

Example: If the location of your "spiderfish.jar" file is:

   C:\Spiderfish\spiderfish.jar

your CLASSPATH should be set like this:

   set CLASSPATH=C:\Spiderfish\spiderfish.jar;%CLASSPATH%

Once these changes are made, save the file and close your editor.  Since the
file is now saved, you do not need to go through these changes again. At this
point, running Spiderfish is done by simply

   double-clicking the "spiderfish.bat" file (for Windows users)

or

   executing the "spiderfish.sh" (for *NIX users)

and the Spiderfish window should come up.  Initially, the window's size and
position may be strange, but repositioning and resizing the window to your
liking will be remembered and used by the application in the future.


*****************************
Basic Operating Instructions
*****************************

To search a particular webpage, put the webpage's URL into the Webpage textbox,
set the Download Directory, and push the "Go" button. All files found by
Spiderfish will be put into the download directory. Changing the "Search to
Level" value will tell Spiderfish when to follow hyperlinks.

A value of 1 means that no hyperlinks should be followed if they lead to another
webpage. Links that point from the main page to downloadable files (*.mp3,
*.avi, *.jpg, etc.) will be followed and the files will be downloaded.

A value of 2 means that all hyperlinks found on the starting page should be
followed, including ones that point to other pages. But, any links from these
"second-level" pages should not be follwed unless they lead to downloadable
files.

Increasing the value past 2 will behave similarly, telling Spiderfish how deep
through hyperlinks to go before stopping.

For faster searching, keep this number small. Setting the value to above 2 may
take a long time to complete and may not get you anything much more than a
level-2 search!


********************************
Advanced Operating Instructions
********************************

All of the advanced options available are found under the Preferences screen.
(Click the File menu and select "Preferences").  Each control has a tooltip
help assocated with it, activated by hovering the mouse over the control.

The value under "Size Limit" represents the minimum size of a file that
SpiderFish will download.  Setting this to a large number (100Kb or more)
ensures that only large files will be downloaded.

The value under "Starting Referer" is the referer given to the server upon
initial connect to the starting webpage.  Normally, it is recommended that
this be left blank since, if no Starting Referer is given, SpiderFish will
use the Starting Page as the Starting Referer as well.

The "Create subdirectories based on hostname" checkbox controls where the
downloaded files will be stored.  If this checkbox is selected and a file
is downloaded, the file will be placed in a subdirectory under the download
directory based on the hostname that the file is coming from.

For example:  We are downloading "waltz.mp3" from host "www.mymusic.com" and
the download directory is "C:\mystuff".  If the checkbox is selected, a
directory "C:\mystuff\mymusic.com" will be created and "waltz.mp3" will be
downloaded into it.

If the checkbox is unselected, the file will be put directly into the download
directory.  (This checkbox is primarily used if a large file volume is predicted
and some initial organization is needed to sort through the resulting files.)

The "Get Text Pages as well as binary files" checkbox will control whether or
not SpiderFish saves the text/html pages it encounters to disk.  If checked and
a text/html file is encountered, the file will be saved on the disk.


*****************************
Running in Command-Line Mode
*****************************

To run SpiderFish in comamnd-line mode, two files must first be constructed.
The first file will be similar to the startup script (either spiderfish.bat or
spiderfish.sh) and, as with these files, will set the PATH, the CLASSPATH and
call java to invoke the SpiderFish application.  However, to tell SpiderFish
to run in command-line mode, a command-line argument must be supplied to a
valid SpiderFish configuration file.  The command will look like this:

   java spiderfish.SpiderFish [config_file_name]

The config file is a normal text file that will give the application all
the information it needs to start.  The format of the config file has keys and
values of the form

   key_name = value_name

Comments can also be placed into the config file by use of the "#" symbol  The
following are valid keys for SpiderFish along with a brief explaination.  Note
that the only requred entries are "start.page" and "download.dir".  All other
keys have defaults if they are absent.

   start.page - starting URL (no default, required)

   download.dir - directory to put downloaded files (no default, required)

   start.referer - starting referer (defaults to start.page)

   log.file - if not present, no logging will occur (default no logfile)

   log.level - values 0-3 where 0=no logging, 3=detailed logging (default 0)

   search.level - must be greater than or equal to 1 (default 1)

   host.dirs - must be YES or NO (default YES)

   get.text - must be YES or NO (default NO)

   size.limit - the min. size (in kb) of a file in order for it to be
   downloaded (default 20)

Here is an example config file that downloads from the fictional website
mp3city.com and puts all files larger than 1Mb into our C:/Downloads/Music
directory.  Note the helpful comments

<----BEGIN---->

# My config file to get some music
# Created March 3, 2003

start.page = http://www.mp3city.com/music/
download.dir = C:/Downloads/Music/
search.level = 2

# Size limit is 1MB, since mp3's usually are bigger than this!
size.limit = 1000

# We don't need to download the HTML, just binary files...
get.text = NO

# All files will come from mp3music.com, so we don't need extra
# classification.  Turn off host directories.
host.dirs = NO

# Name our logfile after the current year, month, and day
log.file = C:/spiderfish_logs/log_%YYYY%%MM%%DD%.log

# Set log level to 1 just to see download failures
log.level = 1

<---- END ---->

I've slipped in a litte extra functionality in here.  Note the "log.file" entry.
We've used 3 constants in the logfile name: %YYYY%, %MM%, and %DD%.  This entry
will give us a logfile name that contains the year, month and date of it's
creation.  For example, if the current date is 02/24/2003, the filename will be

   C:/spiderfish_logs/log_20030224.log

You can use constants like these in any spiderfish configuration entry.  The
complete lists of constants is given below:

%YYYY%    - The current 4-digit year
%YY%      - The current 2-digit year
%MM%      - The current 2-digit month (01-12)
%MON%     - The 3-letter month abbreviation (JAN, FEB, MAR)
%YDAY%    - The current day of the year
%DD%      - The current 2-digit day of the month (01-31)
%WW%      - The current weekday number (1-7)
%WEEKDAY% - The full name of the current weekday (Sunday - Saturday)
%HH%      - The current 2-digit hour (00-23)
%MI%      - The current 2-digit minute (00-59)
%SS%      - The current 2-digit second (00-59)
%HOME%    - The current user's home directory
%USER%    - The current username

User Note: Some characters need to be escaped when being put into the config
file, otherwise they will be improperly interpereted.  As a special case, when
including the backslash character in a config file, you must always precede
it by another backslash character!  So the desired entry

   log.file = C:\mylogs\mylog.txt

must actually be put into the config file as

   log.file = C:\\mylogs\\mylog.txt

Fortunately, most modern versions of Windows also recognize the forward-slash
(/) as a file separator, so escaping characters should not be necessary in
most regular cases.


**************************************
Developers, developers, developers...
**************************************

If you're a smart cookie, you've noticed that, under your SpiderFish
installation directory is a directory called "source".  Yes, you guessed it,
this is the source code for the SpiderFish application as well as the source
for the entire Kosowan.com java base class library.

Under "source", the "spiderfish" package is all code specific to this
application and the "com.kosowan" package contains libraries intended to be
reused in other projects.

If you're interested to investigate the inner workings of SpiderFish, I
strongly recommend you generate the JavaDoc for the "spiderfish" and
"com.kosowan" packages as a first step. I'm a big believer in JavaDoc and
I've put in the time to explain the basics of my classes and methods.

If you'd like to contribute to the SpiderFish project, just drop me an email
and let me know.  There's still a whole lot that still needs to be done!


********************
Contact Information
********************

For the latest updates, visit the SpiderFish Homepage at:
http://www.kosowan.com/spiderfish/

To contact Jason Kosowan, the author of SpiderFish, email to:
jasonkosowan@yahoo.com

Back to SpiderFish Homepage

Back to Main Page

Copyright © 2001-2005 Kosowan.com

User Comments for this page

andrew
(2005-12-30)
  thanks
Share your comments:
All comments are reviewed and spam / garbage posts are discarded
Your name:
Comment: