| Kosowan.com | Spiderfish FAQ |
| Changing your world! | |
*********************** SpiderFish v2.0 README *********************** SpiderFish is an all-purpose HTTP search-and-retrieval program that can be configured for depth of search limits, lower limits of file sizes to download and optional saving of text/html content encountered. ****************** License Agreement ****************** Copyright (C) 2003 Jason Kosowan This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA ********* Features ********* SpiderFish v2.0 represents a significant step forward in our design and incorporates many new features, including: - the ability to run SpiderFish in command-line mode - control over the size limits of downloads - user can optionally save text/html pages to disk as well as binary files - user can turn off the automatic directory-creation based on hostname - starting referer when reading initial webpage can be set - new HTTP socket class ensures that SpiderFish doesn't hang when reading from a slow or stalled server Features from v1.0 include - control over logging levels as well as optional no-logging setting - downloaded files are saved under subdirectories based on the hostname they were retrieved from - User-selectable download directory - User-selectable search level **************** Getting Started **************** Before running Spiderfish for the first time, an installation must be done. Installation of the Spiderfish application is no different than installing and running any other Java-based program. Hopefully this small instruction page will be of help answering some basic questions. However, if you are still stumped, please visit the Java Sun Pages (http://java.sun.com/) for further details. Step 1: Installing Java In order for Spiderfish to run correctly, you must hava a Java Runtime Environment (JRE) installed on your machine. The Runtime Environment installer and all installation instructions can be found from Sun's Java site. Alternatively, if your browser is set up for automatic installation, try going to http://java.sun.com/getjava/. If the automated installation works, you should see an installation progress and then an animated graphic of the Java Coffee Cup logo. You *must* have version 1.4 or greater to run SpiderFish. If an earlier version of Java is used, the SpiderFish application will display a mesage and exit. Step 2: Installing Spiderfish After the installation of the JRE, download the latest zip file of Spiderfish. Unzip this into the directory where you want your Spiderfish installation to reside. After the unzipping, notice three files: "spiderfish.jar", "spiderfish.bat" and "spiderfish.sh". If your java installation is correct and up-to-date (minimum version is 1.4), double-clicking the "spiderfish.jar" file should be enough to start SpiderFish. If activating the jar-file does not run the application, you must use one of the included scripts, as detailed in the next sections. If you are running Windows, the "spiderfish.bat" file can be used to start the application. If you are running under *NIX or Mac OS/X, the "spiderfish.sh" can be used to start the application. Pick the correct one for your operating system and load it into an editor of your choice (e.g. Notepad, vi, Context, etc.) In these scripts, you will notice two environment variables are being set before the SpiderFish application is started: PATH and CLASSPATH. Both variables need to be set correctly before SpiderFish can be run. Step 3: Setting your PATH variable Your PATH variable must include the location of the "java" executable on your system (for Windows users, this is the "java.exe" file). For the default installation of the Sun JRE in Windows, this is usually located under: C:\Program Files\Java\j2re1.x.x\bin where x.x is the version number of your JRE (the current version is j2re1.4.0) As an example, if your JRE installation is located under C:\Program Files\Java\j2re1.4.0\bin your new path in "spiderfish.bat" should be set like this: set PATH=C:\Program Files\Java\j2re1.4.0\bin;%PATH% Step 4: Setting your CLASSPATH Your CLASSPATH variable is used to tell Java where to look for program information. To make Java aware of Spiderfish, the CLASSPATH variable should point to the location of your spiderfish.jar file. Example: If the location of your "spiderfish.jar" file is: C:\Spiderfish\spiderfish.jar your CLASSPATH should be set like this: set CLASSPATH=C:\Spiderfish\spiderfish.jar;%CLASSPATH% Once these changes are made, save the file and close your editor. Since the file is now saved, you do not need to go through these changes again. At this point, running Spiderfish is done by simply double-clicking the "spiderfish.bat" file (for Windows users) or executing the "spiderfish.sh" (for *NIX users) and the Spiderfish window should come up. Initially, the window's size and position may be strange, but repositioning and resizing the window to your liking will be remembered and used by the application in the future. ***************************** Basic Operating Instructions ***************************** To search a particular webpage, put the webpage's URL into the Webpage textbox, set the Download Directory, and push the "Go" button. All files found by Spiderfish will be put into the download directory. Changing the "Search to Level" value will tell Spiderfish when to follow hyperlinks. A value of 1 means that no hyperlinks should be followed if they lead to another webpage. Links that point from the main page to downloadable files (*.mp3, *.avi, *.jpg, etc.) will be followed and the files will be downloaded. A value of 2 means that all hyperlinks found on the starting page should be followed, including ones that point to other pages. But, any links from these "second-level" pages should not be follwed unless they lead to downloadable files. Increasing the value past 2 will behave similarly, telling Spiderfish how deep through hyperlinks to go before stopping. For faster searching, keep this number small. Setting the value to above 2 may take a long time to complete and may not get you anything much more than a level-2 search! ******************************** Advanced Operating Instructions ******************************** All of the advanced options available are found under the Preferences screen. (Click the File menu and select "Preferences"). Each control has a tooltip help assocated with it, activated by hovering the mouse over the control. The value under "Size Limit" represents the minimum size of a file that SpiderFish will download. Setting this to a large number (100Kb or more) ensures that only large files will be downloaded. The value under "Starting Referer" is the referer given to the server upon initial connect to the starting webpage. Normally, it is recommended that this be left blank since, if no Starting Referer is given, SpiderFish will use the Starting Page as the Starting Referer as well. The "Create subdirectories based on hostname" checkbox controls where the downloaded files will be stored. If this checkbox is selected and a file is downloaded, the file will be placed in a subdirectory under the download directory based on the hostname that the file is coming from. For example: We are downloading "waltz.mp3" from host "www.mymusic.com" and the download directory is "C:\mystuff". If the checkbox is selected, a directory "C:\mystuff\mymusic.com" will be created and "waltz.mp3" will be downloaded into it. If the checkbox is unselected, the file will be put directly into the download directory. (This checkbox is primarily used if a large file volume is predicted and some initial organization is needed to sort through the resulting files.) The "Get Text Pages as well as binary files" checkbox will control whether or not SpiderFish saves the text/html pages it encounters to disk. If checked and a text/html file is encountered, the file will be saved on the disk. ***************************** Running in Command-Line Mode ***************************** To run SpiderFish in comamnd-line mode, two files must first be constructed. The first file will be similar to the startup script (either spiderfish.bat or spiderfish.sh) and, as with these files, will set the PATH, the CLASSPATH and call java to invoke the SpiderFish application. However, to tell SpiderFish to run in command-line mode, a command-line argument must be supplied to a valid SpiderFish configuration file. The command will look like this: java spiderfish.SpiderFish [config_file_name] The config file is a normal text file that will give the application all the information it needs to start. The format of the config file has keys and values of the form key_name = value_name Comments can also be placed into the config file by use of the "#" symbol The following are valid keys for SpiderFish along with a brief explaination. Note that the only requred entries are "start.page" and "download.dir". All other keys have defaults if they are absent. start.page - starting URL (no default, required) download.dir - directory to put downloaded files (no default, required) start.referer - starting referer (defaults to start.page) log.file - if not present, no logging will occur (default no logfile) log.level - values 0-3 where 0=no logging, 3=detailed logging (default 0) search.level - must be greater than or equal to 1 (default 1) host.dirs - must be YES or NO (default YES) get.text - must be YES or NO (default NO) size.limit - the min. size (in kb) of a file in order for it to be downloaded (default 20) Here is an example config file that downloads from the fictional website mp3city.com and puts all files larger than 1Mb into our C:/Downloads/Music directory. Note the helpful comments <----BEGIN----> # My config file to get some music # Created March 3, 2003 start.page = http://www.mp3city.com/music/ download.dir = C:/Downloads/Music/ search.level = 2 # Size limit is 1MB, since mp3's usually are bigger than this! size.limit = 1000 # We don't need to download the HTML, just binary files... get.text = NO # All files will come from mp3music.com, so we don't need extra # classification. Turn off host directories. host.dirs = NO # Name our logfile after the current year, month, and day log.file = C:/spiderfish_logs/log_%YYYY%%MM%%DD%.log # Set log level to 1 just to see download failures log.level = 1 <---- END ----> I've slipped in a litte extra functionality in here. Note the "log.file" entry. We've used 3 constants in the logfile name: %YYYY%, %MM%, and %DD%. This entry will give us a logfile name that contains the year, month and date of it's creation. For example, if the current date is 02/24/2003, the filename will be C:/spiderfish_logs/log_20030224.log You can use constants like these in any spiderfish configuration entry. The complete lists of constants is given below: %YYYY% - The current 4-digit year %YY% - The current 2-digit year %MM% - The current 2-digit month (01-12) %MON% - The 3-letter month abbreviation (JAN, FEB, MAR) %YDAY% - The current day of the year %DD% - The current 2-digit day of the month (01-31) %WW% - The current weekday number (1-7) %WEEKDAY% - The full name of the current weekday (Sunday - Saturday) %HH% - The current 2-digit hour (00-23) %MI% - The current 2-digit minute (00-59) %SS% - The current 2-digit second (00-59) %HOME% - The current user's home directory %USER% - The current username User Note: Some characters need to be escaped when being put into the config file, otherwise they will be improperly interpereted. As a special case, when including the backslash character in a config file, you must always precede it by another backslash character! So the desired entry log.file = C:\mylogs\mylog.txt must actually be put into the config file as log.file = C:\\mylogs\\mylog.txt Fortunately, most modern versions of Windows also recognize the forward-slash (/) as a file separator, so escaping characters should not be necessary in most regular cases. ************************************** Developers, developers, developers... ************************************** If you're a smart cookie, you've noticed that, under your SpiderFish installation directory is a directory called "source". Yes, you guessed it, this is the source code for the SpiderFish application as well as the source for the entire Kosowan.com java base class library. Under "source", the "spiderfish" package is all code specific to this application and the "com.kosowan" package contains libraries intended to be reused in other projects. If you're interested to investigate the inner workings of SpiderFish, I strongly recommend you generate the JavaDoc for the "spiderfish" and "com.kosowan" packages as a first step. I'm a big believer in JavaDoc and I've put in the time to explain the basics of my classes and methods. If you'd like to contribute to the SpiderFish project, just drop me an email and let me know. There's still a whole lot that still needs to be done! ******************** Contact Information ******************** For the latest updates, visit the SpiderFish Homepage at: http://www.kosowan.com/spiderfish/ To contact Jason Kosowan, the author of SpiderFish, email to: jasonkosowan@yahoo.com
Copyright © 2001-2005 Kosowan.com
| User Comments for this page |
| andrew (2005-12-30) | thanks |