README
Note: This is an old version of the README file.
Database Advisor Last Updated: October 7, 1998
Copyright (c) 1998 Regents of the University of California
--------------------------------------------------------------------------------
Table of Contents:
------------------
I. Overview
II. License Information
III. Required Files
IV. Other Vital Files
V. How the pieces fit together
A. The Engine
B. Server Pushing: Netscape vs Microsoft
C. The Database Interfaces
D. The Profiles
E. The Subject Files
VI. How to Add a Database Interface
VII. Error Messages in Database Interfaces
VIII. The Message Queue
IX. Signal Handling
X. Running Multiple Libraries off the Same Engine
XI. Passwords
Appendix - Authors and Contact Points
--------------------------------------------------------------------------------
Section I:
Overview:
-----------------
This software was developed by the web programmers of the Science Libraries
at the University of California, San Diego Campus.
Database Advisor(DBA) was created to aid database users in selecting the
best database for their query. DBA spawns a search process for each
database vendor, and returns the hits on the query to the user. It sorts
these results so the user can see where each database stands relative to
the others. Each database has a link which can be followed to access the
database (though the terms of usage that the vendor sets still apply)
Each database has a Profile which stores information about the database.
--------------------------------------------------------------------------------
Section II:
License Information:
--------------------
Copyright (c) 1998 Regents of the University of California
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License (available at
http://www.gnu.ai.mit.edu/copyleft/gpl.html) for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
USA.
This software was developed by the Science Libraries at the University of
California, San Diego. For more information, contact Christy Hightower
at the Science & Engineering Library, 0175E,
University of California, San Diego, 9500 Gilman Drive, La Jolla,
California, 92093-0175.
--------------------------------------------------------------------------------
Section III:
Required Files:
---------------
CGI packages
- These packages, found in Perl 5, contains useful packages such as
URL.pm, CGI.pm, Headers.pm, Request.pm, etc...
dbaLocal.pl
- This is a group of directory identifiers
dbaPasswd.pl
- This contains variables which will be your user name and password
for various databases.
Display.pl
- Display.pl has all the different display strings which are used
in the program (ie, YouSearchedFor, dbaFooter, etc...). Because
they are in this file, you can use multiple different displays.
loaddb.pl
- This file holds the functions which will load the database into
a hash which the various database interfaces will access.
nph-dba
- This is the engine of Database Advisor. It has several auxiliary
files which it needs in order to run.
nph-profile
- This file generates the profiles from the .db files found in the
/dba/dbfiles directory.
push.pl
- This handles the underlying code for the server pushes
stopwords.db
- This is the database file which stopwords.pl will parse
stopwords.pl
- This parses a list of various "stopwords": Words, symbols, or phrases
which will cause problems in the search engines (ie, AND, OR, &, etc)
subjects.db
- This has a list of all the subjects and the databases which are
classified under it.
subjects.pl
- subjects.pl reads subjects.db and parses out the various subjects
and therefore which databases will be searched.
DBA dependencies:
Here is information on obtaining PERL5 and the LWP modules which
are required to run DBA. Z39.50 is required to run the Melvyl
module as well as other Z39.50 databases. Once you create
zclient it should be installed in this directory under z39
as zclient.
Z 39.50 Client
----------------------------------------------------------------------
"the Z39.50 API zclient" software for Z39.50 connections.
You can obtain the code from
http://lindy.stanford.edu/~hrf/z3950/www_gateway.html
source of this code is not ucsd/dba project. it is:
name: Finkbeiner, Harold R
e-mail: harold_finkbeiner@Stanford.Edu
department: Info Tech Systems & Services
position: Sys Sw Developer,Prin
address: Polya Hall, Rm. 208
phone: (650) 725-3353
fax: (650) 723-3253
mail-code: 4136
date-updated: Dec 12 1997 12:15AM
----------------------------------------------------------------------
GNU C compiler
----------------------------------------------------------------------
to compile the zclient code for your machine.
go to http://www.delorie.com/gnu/ for more information on the
GNU project and a list of FTP sites for GNU software.
----------------------------------------------------------------------
perl
----------------------------------------------------------------------
perl, version 5.004_01 or later.
You can get the lastest version of perl and the modules listed
below from:
http://www.perl.com
modules:
CGI.pm
the LWP (formerly known as "libwww") Module
(this includes: HTTP::Request; LWP::UserAgent; URI::URL)
Note that LWP has its own dependencies which are documented
when you download and install it. these perl modules are also
available from by going to the www.perl.com site above.
these LWP prerequisites include:
MIME-Base64
HTML-Parser
libnet
MD5
wwwurl.pl (should be included with your perl distribution)
cgi-lib.pl (available from
http://www.seas.upenn.edu/~mengwong/forms/cgi-lib.pl.txt)
--------------------------------------------------------------------------------
Section IV:
Other Vital Files:
------------------
Various profiles
- The profiles are what the user sees when they would like to know
more about the database in question. They also contain useful
knowlege about what URL's to use as well as short descriptions
and subjects that they belong to.
Various Database Interfaces
- Without the database interfaces, DBA does nothing. These scripts
actually contact the vendor, and request the information. There
are three ways it can do this: Z39.50, telnet, and web. The Z39.50
interfaces are fairly simple, using a single connection, query,
and response. The telnet and web interfaces are based on pattern
matching and take longer as they must first traverse several layers
of information before being able to input the desired query.
z3950.pl
- This contains the commands to use when connected to a Z39.50 server
--------------------------------------------------------------------------------
Section V:
How the pieces fit together:
----------------------------
-------------
A. The Engine
-------------
To understand Database Advisor(DBA), you must first understand the engine.
The engine keeps track of various things:
Timing - How long will DBA run before timing out?
Library id - Which Library is running DBA; and therefore which
display strings and databases should we be using?
Subjects - Which subjects did the user select, and therefore what
databases did they select
Logging - Should we log this session in the DBA logfile?
Server Pushing - What information should we send to the "push" functions
Child Processes - Which child processes (searches) have checked back
in during the time allotment?
IP Checking - Should we warn users that they are not using an on-campus
machine? (therefore they might not have access to the databases)
The main job of the engine, however, is to run the database interfaces and
wait for their replies. Each database interface contains a fork which allows
the main engine to continue running while the new search process gathers the
data. The program runs in a loop until the time allotment has been met. While
in the loop, there is a message queue waiting for inter-process communications.
If it receives information, then it unpacks the data and sorts it according
to the number of hits (with previously received messages). Then it performs
a server push, which will allow the user to see the data as it comes in.
After the time limit is up, it will consider any remaining processes to be
timed out and it performs a final server push with this new information.
----------------------------------------
B. Server Pushing: Netscape vs Microsoft
----------------------------------------
One of the nice features of Database Advisor is the "real time" hits display.
This is accomplished via Server Pushing. Unfortunately Netscape and Microsoft
deal with Server Pushing differently.
In Netscape Navigator, when we push information to the browser, the browser
clears the screen and displays the new information. In this way, we can send
the current results to the user, allowing them to halt the program and choose
a database (if desired).
In Microsoft Internet Explorer(MSIE), the Server Push results in appending the
information onto the current document, creating a long and repetitive
results screen. We modified the "push" functions to handle MSIE by appending
a message stating that results had been *received* from the database, but
it did not specify the amount of hits. On the final server push, the engine
writes a temporary file and then refreshes the MSIE browser to that location.
***Note***: This creates files that need to be deleted using a cron or some
other scheduled script.
--------------------------
C. The Database Interfaces
--------------------------
The Database Interfaces do the actual work of Database advisor. There are
three different types of database interfaces: Z39.50, telnet, and web.
The Z39.50 protocol is used by Melvyl(r) and other vendors to provide a quick
way to access the data in the databases. Usually, there is so little time
expended in running the searches that we run the searches serially instead
of running them in parallel. This cuts down the overhead of opening a Z39.50
server for each database to retrieve the information. For more technical info
on Z39.50, please see the Melvyl.pl file (which is an actual Z39.50 database
interface), or see the documentation on Z39.50.
The Telnet protocol is used in databases such as BIOSIS, where there is no
web version availiable as of yet. Most of these are currently being transferred
to the web. For these, we open a socket to the destination, and wait until the
pattern we are looking for (often a prompt) appears. This is rather unreliable
as we don't know when we have received all the information. For more technical
info on Telnet, see the Telnet.pl file (which is an actual Telnet database
interface).
The Web protocol is used by the rest of the databases. Each search is limited
by the speed of the webserver. Depending on the database implementation, there
is the possibility of accessing the search engine directly from the web.
Sometimes it is necessary to travel through the various pages to reach the
point where the search query can be input. The nice part of the web interfaces
is the HTTP packages which enable the interfaces to have very little code
regarding connections in it. Also, the information is all sent at once
(contrary to telnet, where we don't know when the end of transmission is), and
can be parsed out after receiving the reply.
Each type starts off by forking off a varying number (depending on how many
databases are offered by that vendor) of database searches so the engine can
proceed with running other database interfaces. Then it finds out the
information (via Z39.50,telnet, or web), and returns this information to the
engine via an interprocess communication. After it sends the message, the
process dies. If the database limits concurrent users, then the interface
should also properly log out of the database.
Each process has it's own timeout method as well. The time for timeout
is taken from $main'timeout (which can be specified by the user). This way,
there are not processes which run forever.
---------------
D. The Profiles
---------------
The profiles are an important part of Database Advisor. In them is the
information that the user needs to see, as well as information DBA needs
in order to complete the queries. The important field in the profiles
(to DBA) is the "DBA URL" field. It contains the URL which the database
interface will use to start the search query. This is useful if the interface
runs several databases from the same vendor, some of which require different
URLs.
--------------------
E. The Subject Files
--------------------
The subjects.db file lists the various subjects as well as the databases
which are considered in that subject. To facilitate multi-library use, the
subjects.db file has a library id which is declared before every database
name. For formatting details, consult the subjects.db file. This file is
then parsed and put into an associative array by library and by subject.
This array is then used in the engine and the database interfaces to determine
which databases the user wants searched.
--------------------------------------------------------------------------------
Section VI:
How to Add a Database Interface:
--------------------------------
If you can not find the database interface you are looking for on the web,
then you can attempt to create one on your own. The easiest way to accomplish
this is to take a look at existing interfaces and model the new one on it.
Once you have either created a new interface, or appended an extra fork
on an existing interface, you need to create a profile for it. To avoid
parsing errors, please copy an existing profile and edit it to suit your
database.
Then you will need to edit the subjects.db file to include your profile
under one or more of the subject headings found there. Note: If you want
it to show up in the All Subjects catagory, you must include it there
as well. It will not search through the subjects for non-declared databases.
Once this is done, add the interface (yourdbname.pl) to the list of Database
Interface Includes. This will allow DBA to "see" the interface file. If the
file doesn't compile, then DBA will crash as well (you will get a "Document
Contains no Data" error most likely).
After you have required the file, you will need to go to where it
starts the child processes and add a execute function line. If your interface
has multiple database accesses within it, you will need to pass the associative
variable %databases to it. %databases is a list of all the profile names under
the subject(s) the user defined. If your interface just searches one database,
you might want to run an "if" statement to see if your profile was defined.
After you do this, you are done, the database has been added.
--------------------------------------------------------------------------------
Section VII:
Error Messages in Database Interfaces:
--------------------------------------
Often, a database will have a Remote server error, or the interface will parse
the data incorrectly. Both result in an error. If a connection fails between
DBA and the host computer, then that is considered a Remote Server Error. If
the data is parsed incorrectly, that is considered a Local Server Error (which
means that the DBA host can fix it (and should!)).
When the database interface finds the number of hits, it adds 10
(or whatever is in $main'baseReturnValue) to this value. Adding 10
allows for special return types, which are used for errors. These 10 values
are used as indexes in an array of error strings. For now, 1 is reserved
for Local Errors, 2 is for Remote Errors, 4 is for Timeouts, and 6 is for
Too many Users.
If the DBA support team finds other errors (such as Service Unavailable),
then they can add errors themselves and use alternate return values in the
interface code.
--------------------------------------------------------------------------------
Section VIII:
The Message Queue:
------------------
The message queue is what ties the child processes (database interfaces) to the
parent process (DBA engine). After the database interface has attempted to
retrieve the number of hits on a user's query, it returns the name of the
database and the results (which are padded by $main'baseReturnValue to account
for errors) using "pack". Then, it uses "msgsnd" to send the message back to
the parent process where it is unpacked and displayed.
--------------------------------------------------------------------------------
Section IX:
Signal Handling:
----------------
There is an important piece of code in DBA called "handler". This handles
the signals that come from the user. If the user halts the browser before
the message queue is closed, then there will be an orphan inter-process
communication left. If left untended, this can result in a complete breakdown
of DBA (the user will hit submit, and they will get only the first push).
--------------------------------------------------------------------------------
Section X:
Running Multiple Libraries off the Same Engine:
-----------------------------------------------
For expansion purposes, Database Advisor was created so it can run multiple
request types if need be. If, for instance, there was one Library that wanted
Database Advisor for science databases and another which wanted humanities
subjects covered, both could be accomodated using only one copy of Database
Advisor.
The change is made in the subjects.db file. This is the file which stores all
the various subjects and what databases fall under them. By adding a new
Library ID, a library can set up their own specifications for Database Advisor.
Then, a new DBA HTML page must be created with the new subjects and with the
hidden element "libid" set to the new Library ID.
When the search is started using this new HTML page, the engine will select the
databases which correspond to the "libid". Thus, you can have various subjects,
and various *sets* of subjects.
-------------------------------------------------------------------------------
Section XI:
Passwords
---------
The file dbaPasswd.pl is available to house usernames and passwords for
those databases which require them. This version of DBA includes
database interfaces which require passwords for ASFA, METADEX and SocAbs.
Calls to these interfaces have been disabled by commenting out calls to
them from nph-dba. To include them in your system, edit dbaPasswd.pl to
include your usernames and passwords and uncomment the require and
execute calls in nph-dba.
--------------------------------------------------------------------------------
Appendix:
Authors and Contact Points:
---------------------------
Authorship of Database Advisor (in chronological order):
Neil Spring
- DBA Engine Author, Z39.50 Connection Interfaces
Greg Kogut
- Telnet Database Interfaces
Scott Petersen
- Web Database Interfaces, Implementation, Documentation
Maintenance and Contact Point
UCSD Science Library Web Programmers (techies@scilib.ucsd.edu)
Project Management and Supervision
Christy Hightower, UCSD Science Librarian chightow@ucsd.edu
Database Advisor Interface Design Team
Christy Hightower, UCSD Science Librarian
Jennifer Reiswig, UCSD Biomedical Librarian
Susan Berteaux, Scripps Inst. of Oceanography Librarian
--------------------------------------------------------------------------------