|
| |||
|
|
|
Nolce (Netscape's Off Line Cache Explorer) is a Linux program which allows an off-line navigation of Netscape Navigator cache files adjusting their names and links.
Introduction |
Every Netscape Navigator user probably knows
that it saves almost all files downloaded from the Internet in the local hard disk,
unless this option has been disabled by the user. Html files, images,
and downloaded documents are normally stored under the directory
$HOME/.netscape/cache
.
One could like to view those downloaded documents also off-line, read them
with calm, and possibly save them with related images.
But this isn't immediately possible, because stored files in the cache have their names
changed, i.e. an original main.html may become a
cache33BAD64001B0829.html
.
Besides they are stored under the cache directory in subdirs like 00,
01, ...
without any respect of the relative positions of files.
So even if you could guess what cached file corresponds to your desired
document, you see it without any image and with all links not working.
Saving a document from Netscape after receiving it doesn't save the related
images and links, so you can see only the textual part of the document when
you are off-line.
One can think that in this situation Netscape could retrieve lacking images from the cache, but it
isn't so because before using a cached file, it tries to connect to the
original site to check if the remote file is more recent than the local. As if
you're off-line this check isn't possible, the local file isn't used.
Usage and what the program does |
The file index.db
under the Netscape cache directory contains the informations
necessary to associate cached files with their original names, sizes, creation
date, file type and so on. It is created by Netscape when first documents
are cached.
Nolce must not run when Netscape is in execution, because the file
index.db
may be damaged if two programs open it at the same time. To avoid problems,
nolce uses and recognizes the same lock file of Netscape, so when one of
the two programs runs, the other knows that it can't use the cache.
Lock file is a symbolic link called lock
created in the directory
$HOME/.netscape
.
(However, from version 1.8-1 the option -l
may be used to
ignore lock file, see below)
With those informations nolce can copy those files in a new directory
structure under dest_dir
(default is $HOME/cached
) which
reflects the directory structure of the original site of the file,
restoring obviously their real names.
For example if 00/cache33BAD64001B0829.html
corresponds to an URL like
http://www.rai.it/raiuno/aree.html
, the program creates the directory
www.rai.it
, then under it the direcory raiuno
and finally copies
cache33BAD64001B0829.html
into aree.html
under it.
A summary file is created as an html file, so after that the program finishes
one can easily know what html documents it retrieved and can easily browse
them.
When viewing retrieved documents, links which are in italics are
links to other cached files, so you can view them off-line too.
Note that some fixed fonts may render italics as bold.
Copied html files are slightly modified when necessary, but we'll talk of this in the section HOW IT WORKS.
Nolce doesn't change in any way the original Netscape cache, which continues to work normally.
From version 1.5, nolce can also process caches generated by Netscape
for Windows, with the option -p
.
Let's now talk about how using nolce.
First of all you can obtain a small help launching it with --help
and this is what you get:
n_hours
parameter is very useful when you want to process only
the files downloaded during the last connection.
dest_dir
is the direcory under which will be created the direcory structures.
The program will distinguish between http://
and
ftp://
documents putting the
first ones under a subdir http
of dest_dir
and
the second ones under ftp
.
summary_file
will be always created in dest_dir, even if you supply an
absolute path. If summary file exist, it is not overwritten, but new entries
are appended to it.
summary_file
contains an entry for every HTML file processed.
summary_file
.
-m
option, missing
images are kept.
sub_string
(options -g
or -G
) is case sensitive.
-p
must be used if the cache to be processed
is generated by Netscape for Windows. In this case the name of index
file is assumed to be fat.db
and file names are all converted to
lower case, as are Dos files viewed from Linux.
-l
, the cache is processed even if there is a lock
file in $HOME/.netscape
. It's useful when the cache specified
with -c
isn't the one the Netscape in execution uses, or when
Netscape isn't installed. Use with care, and don't launch more copies of nolce on
the same directory.
nolce -smc /cache
is the same of nolce -s -m -c /cache
or nolce smc/cache
.
i.
Seems that Netscape
doesn't save in the cache HTML files whose it couldn't know modification
time, even if related images are saved. Sometimes the percentage of
such files is low, but sometimes it's about the 50% of total files, so this
may be a serious trouble, which, however, can be overridden with a small trick.
In fact Netscape in a first moment saves those files and registers them
in the cache index, but when it exits, checks if there are HTML files
whose it doesn't know modification time and deletes them. So the one
way to maintain these files is to kill brutally Netscape when one
finishes navigation.
When we close Netscape with Ctr-C from the shell, or, worse,
choosing `Exit' from its menu, the browser has all the time for doing
the cache's cleaning we want to avoid, but if we kill it with the SIGKILL
signal its execution ends immediately, because there is no way to
catch and to handle that signal.
The command to give is:
kill -s 9 `pidof netscape`where
`pidof netscape`
is a manner to obtain process
identifier of Netscape (see also the command ps
).
kill -s 9 PIDwhere
PID
is the process ID of your Netscape.
Killing the browser with SIGKILL, it can't delete lock file, so it's necessary doing a
rm $HOME/.netscape/lockA simple shell script can automate this procedure. For example, for a single user environment, create, somewhere in your path, a file called (for example)
nk
with this content:
#!/bin/sh kill -s 9 `pidof netscape` rm $HOME/.netscape/lockthen execute
chmod +x
on it and you're o.k.
Note that if you kill Netscape to retrieve at-risk documents, nolce
must to be launched before next Netscape's execution, at the end of which the
browser will do the cache's cleaning it couldn't do in the previous
execution.
For this reason, it's not advisable to use the -k
switch, that is using
symbolic links for non html documents.
ii.
You may not find everything you expect in the cache, even using the previous tip. It happens that certain documents or
images aren't saved, without any apparent reason.
In any case, it's better to press the STOP button before going away from
a page not completely loaded.
Some images, typically counters provided at run-time by cgi-bin servers, aren't
even saved.
index.db
rather than modification time of
the file. This way is faster and better.
dest_dir
and adjusts their links) this file also, even if won't
appear in the summary file. If one doesn't want this, the option
-f
may be used.
This option has the same meaning also in conjunction with
-g
and -G
.
If neither -w
or -W
option is given, pages will be
displayed in another browser window. Normally the other window is created once, then,
if the user doesn't close it, it's used every time a document is selected.
With -W
the document is viewed in the list frame,
allowing an easy selections of other domains and other documents.
Finall, with -w
, the entire index window is used for viewing pages.
Selecting Lists & domains or Simple List from the status frame, one can return immediately to the index of processed pages, but in the first case the default layout (domains + list) is used, while in the second the list area takes all the space below the status frame.
-p
option.
msdos
rather than vfat
because in the first case access is faster and
file names aren't case sensitive.
Installation |
This software is available in a package containing both source and binary versions.
It can be obtained at
ftp://sunsite.unc.edu/pub/Linux/apps/www/plugins and at
https://members.tripod.com/~giustrov/download.html
For using this program, you must have installed the DB library.
It's necessary to read records
stored in the index.db
file.
In practice you need libdb.so
to run the compiled version, and also db include
files to compile the program.
For Linux, with Slackware and Redhat distributions, the library should be
present by default.
For the include files, with Redhat you must install a package called
db-devel
or similar. For Slackware, they are in
libc.tgz
, so they aren't a problem.
For compiling, cd to src
subdir and do make
.
Do make install
to compile and copy the executable in
/usr/bin
, the man
page in /usr/man/man1
and the documentation in
/usr/doc/nolce-VERSION
.
If standard destinations don't fit your taste, modify them in the Makefile.
Compatibility |
I have tested the program under Linux only, and with Netscape Navigator 3.01,
4.0b5, and 4.03 .
Probably it works with version 2.0 also, since the present format of the cache
was introduced with this release.
It should work also with other Unix, if their Netscape indexes its cache in the
same way of the linux version, that is with a DB hash file named
index.db
under $HOME/.netscape/cache
.
If the name is different, it's easy to
change the value of CACHE_FILE, in the defines section of the source file.
From the point of view of the language, I use code conforming to ANSI C or
POSIX standards only, so if your system supports them, there must be no
problems.
As I know, the following circumstances may cause problems or errors in
compiling nolce
:
make
correctly defines the
variable CC
as your
site compiler name (i.e cc or gcc).
This must be ensured by every make
, but if not, define them by hand.
flex -l
. This is what happens on some Slackware systems, where flex calls the real
program flex.slk with the -l option. The result is a segmentation fault error
when nolce is executed.
-Darray
to DEFINES
in the Makefile (see below),
solves the problem.
LEX=flex
is present in the Makefile. On non Linux systems, this probably should be changed.
-lfl
library, and it's provided in the variable
LDFLAGS
of the Makefile.
LFLAGS
variable.
yylex()
function, called in the process_html_file
of main.c
.
Input and output files are supplied to yylex with the extern variables
yyin
and yyout
. Probably this is not conforming with original AT&T lex,
but, as I know, it conforms to POSIX specification for lex, and, above all,
it's almost the only way one can use with flex.
yytext
as a char pointer, while other lex may define it as a
char array. If this is your case, you must compile main.c
with
the -Darray
option, which can be done by setting the variable DEFINES
of the Makefile.
If you discovery a bug, i.e. an abnormal exit of the program with a Segmentation Fault error, please let me know. You should send me an e-mail with a brief
description of the circumstances under which the error happened, command line
options, and above all the core file generated by the program (compress it to avoid mail messages too heavy).
Shells permit to decide if one wants to obtain a core dump after an abnormal
termination of a program. With bash
see the command ulimit
.
For being the core file useful to me, it must be generated by a program
compiled with debug info: add the option -g3
to CFLAGS
in the Makefile. If you have libg
installed, add also -lg
to LDFLAGS
.
However, before sending the core file, it could be useful the simple output of
gdb
. In case of problems, compile nolce with debug infos, launch it from
the debugger, and when the execution stops with the error, inside gdb
,
give the command bt
and send to me the informations displayed.
How it works |
A lot of urls, i.e. http://home.netscape.com
, don't contain an HTML file name.
In this situation the server provides a default HTML file, usually
index.html
,
and nolce appends this same name to these urls.
It could happen that an HTML file contains a link to such an url with the file
name explicited. If this name is different from index.html
, the link doesn't
work.
ii. LINKS
The main work nolce does is changing links in HTML files to point to local
files.
There are various types of links (imagine you're browsing the document
http://www.aaaa.com/bbb/index.html
):
HREF="ccc/image.gif"
. In this case the browser loads
the file image.gif
from the directory ccc
under
bbb
.
HREF="http://www.aaaa.com/ccc/image.gif"
. In this
case Netscape will always try to obtain the document from the net, so
nolce transforms the link in something like "../ccc/image.gif"
.
HREF="/ccc/image.gif"
. These links must be
interpreted as http://www.aaaa.com/ccc/image.gif
, not regarding of the
directory in which the HTML files is.
If a link points to a document present in the cache, it is changed to a relative link, otherwise it's turned in an absolute link.
iii. LEX
If your lex program is GNU flex, the flag -Cf
may be given to it (put in
the variable LFLAGS
of the Makefile). This makes the program bigger, but
execution speeds up of 10-15%.
iv. MISCELLANEOUS
nolce.h
there are some defines which can be customized.
<h3>Link</h3>
, the italics isn't shown.
`?'
. Mainly for this reason, when creating
directories, strange characters like `?', `=', `('
and so on are substituted
with an underscore.
Contacting the author |
For any question, bug report or comment, email to g.trovato@usa.net
My home page is
https://members.tripod.com/~giustrov
Nolce web page is:
https://members.tripod.com/~giustrov/nolce.html