README for nolce

Nolce (Netscape's Off Line Cache Explorer) is a Linux program which allows an off-line navigation of Netscape Navigator cache files adjusting their names and links.

Introduction

Every Netscape Navigator user probably knows that it saves almost all files downloaded from the Internet in the local hard disk, unless this option has been disabled by the user. Html files, images, and downloaded documents are normally stored under the directory $HOME/.netscape/cache.

One could like to view those downloaded documents also off-line, read them with calm, and possibly save them with related images. But this isn't immediately possible, because stored files in the cache have their names changed, i.e. an original main.html may become a cache33BAD64001B0829.html. Besides they are stored under the cache directory in subdirs like 00, 01, ... without any respect of the relative positions of files.
So even if you could guess what cached file corresponds to your desired document, you see it without any image and with all links not working.

Saving a document from Netscape after receiving it doesn't save the related images and links, so you can see only the textual part of the document when you are off-line.
One can think that in this situation Netscape could retrieve lacking images from the cache, but it isn't so because before using a cached file, it tries to connect to the original site to check if the remote file is more recent than the local. As if you're off-line this check isn't possible, the local file isn't used.

Usage and what the program does

The file index.db under the Netscape cache directory contains the informations necessary to associate cached files with their original names, sizes, creation date, file type and so on. It is created by Netscape when first documents are cached.
Nolce must not run when Netscape is in execution, because the file index.db may be damaged if two programs open it at the same time. To avoid problems, nolce uses and recognizes the same lock file of Netscape, so when one of the two programs runs, the other knows that it can't use the cache.
Lock file is a symbolic link called lock created in the directory $HOME/.netscape .
(However, from version 1.8-1 the option -l may be used to ignore lock file, see below)

With those informations nolce can copy those files in a new directory structure under dest_dir (default is $HOME/cached) which reflects the directory structure of the original site of the file, restoring obviously their real names.

For example if 00/cache33BAD64001B0829.html corresponds to an URL like http://www.rai.it/raiuno/aree.html, the program creates the directory www.rai.it, then under it the direcory raiuno and finally copies cache33BAD64001B0829.html into aree.html under it.

A summary file is created as an html file, so after that the program finishes one can easily know what html documents it retrieved and can easily browse them.
When viewing retrieved documents, links which are in italics are links to other cached files, so you can view them off-line too.
Note that some fixed fonts may render italics as bold.

Copied html files are slightly modified when necessary, but we'll talk of this in the section HOW IT WORKS.

Nolce doesn't change in any way the original Netscape cache, which continues to work normally.

From version 1.5, nolce can also process caches generated by Netscape for Windows, with the option -p.

Let's now talk about how using nolce. First of all you can obtain a small help launching it with --help and this is what you get:

Usage: nolce [n_hours] [OPTIONS]... Reads Netscape Navigator (ver. 2 and above) cache files created in the last n_hours hours and copies them in a new directory adjusting file names and links to permit an off-line navigation of them. If n_hours isn't supplied, all cached files are processed. Options: -c cache_dir directory where cache is, default $HOME/.netscape/cache -d dest_dir directory where files are copied, default $HOME/cached -g sub_string process only files whose URL contains sub_string -G sub_string process only files whose URL doesn't contain sub_string -i summary_file file name of summary, it will be created in dest_dir, default is index.html -w show pages in the same index window -W show pages in the list frame -s execute silently -m don't eliminate missing images -t put downloading date of documents in summary file -f don't process links not satisfying initial conditions -p cache is generated by Netscape for Windows -l ignore lock files (use with attention: see docs) -k make symbolic links for non html files --help shows this help

Some considerations

Giving the n_hours parameter is very useful when you want to process only the files downloaded during the last connection.
dest_dir is the direcory under which will be created the direcory structures. The program will distinguish between http:// and ftp:// documents putting the first ones under a subdir http of dest_dir and the second ones under ftp.
summary_file will be always created in dest_dir, even if you supply an absolute path. If summary file exist, it is not overwritten, but new entries are appended to it.
summary_file contains an entry for every HTML file processed.
To avoid confusion, if a page contains frames, single frames are not reported in summary_file.
By default, missing images are totally eliminated from the HTML file, so one doesn't see the Netscape icon indicating them. With the -m option, missing images are kept.
sub_string (options -g or -G) is case sensitive.
Option -p must be used if the cache to be processed is generated by Netscape for Windows. In this case the name of index file is assumed to be fat.db and file names are all converted to lower case, as are Dos files viewed from Linux.
With -l, the cache is processed even if there is a lock file in $HOME/.netscape . It's useful when the cache specified with -c isn't the one the Netscape in execution uses, or when Netscape isn't installed. Use with care, and don't launch more copies of nolce on the same directory.
Starting from version 1.7, command line switches, unless n_hours, may be grouped, that is nolce -smc /cache is the same of nolce -s -m -c /cache or nolce smc/cache.

Important notes

i.
Seems that Netscape doesn't save in the cache HTML files whose it couldn't know modification time, even if related images are saved. Sometimes the percentage of such files is low, but sometimes it's about the 50% of total files, so this may be a serious trouble, which, however, can be overridden with a small trick.
In fact Netscape in a first moment saves those files and registers them in the cache index, but when it exits, checks if there are HTML files whose it doesn't know modification time and deletes them. So the one way to maintain these files is to kill brutally Netscape when one finishes navigation.
When we close Netscape with Ctr-C from the shell, or, worse, choosing `Exit' from its menu, the browser has all the time for doing the cache's cleaning we want to avoid, but if we kill it with the SIGKILL signal its execution ends immediately, because there is no way to catch and to handle that signal.
The command to give is:

  kill -s 9 `pidof netscape`

where `pidof netscape` is a manner to obtain process identifier of Netscape (see also the command ps).
If there is more than a copy of Netscape running, the above command will close all of them, so it's better to use:

  kill -s 9 PID

where PID is the process ID of your Netscape.

Killing the browser with SIGKILL, it can't delete lock file, so it's necessary doing a

  rm $HOME/.netscape/lock

A simple shell script can automate this procedure. For example, for a single user environment, create, somewhere in your path, a file called (for example) nk with this content:

  #!/bin/sh
  kill -s 9 `pidof netscape`
  rm $HOME/.netscape/lock

then execute chmod +x on it and you're o.k.

Note that if you kill Netscape to retrieve at-risk documents, nolce must to be launched before next Netscape's execution, at the end of which the browser will do the cache's cleaning it couldn't do in the previous execution.
For this reason, it's not advisable to use the -k switch, that is using symbolic links for non html documents.

ii.
You may not find everything you expect in the cache, even using the previous tip. It happens that certain documents or images aren't saved, without any apparent reason.
In any case, it's better to press the STOP button before going away from a page not completely loaded.
Some images, typically counters provided at run-time by cgi-bin servers, aren't even saved.

About parameters n_hours, -g and -f

When giving the n_hours option, only HTML files which are downloaded after n_hours ago are processed. Starting from version 1.5 time check is made using informations of index.db rather than modification time of the file. This way is faster and better.
Time check is made only on HTML documents. Everything other, that is images, zip files... are always valid.
If a document that satisfies the n_hours condition has a link to another which is in the cache, but was downloaded before of n_hours, nolce processes (that is copies under


dest_dir

and adjusts their links) this file also, even if won't appear in the summary file. If one doesn't want this, the option -f may be used. This option has the same meaning also in conjunction with -g and -G.
Regarding to messages shown during nolce execution, files in order with the n_hours (or -g, or -G) condition are called main HTML files, the others related HTML files.

About the summary file

Starting from version 1.5, the format of summary file changed. Now it's a document divided into three areas (frames). The strip on the top is the status frame, the area on the left is the domains frame, and the other is the list frame.
Domains windows contains all different domains encountered during retrieval of pages. Clicking on a domain name, available documents, related to that domain are displayed in the list frame.
To view a retrieved document, click on its icon, while clicking on the URL the page is downloaded from the Internet (if the connection is active).

If neither -w or -W option is given, pages will be displayed in another browser window. Normally the other window is created once, then, if the user doesn't close it, it's used every time a document is selected. With -W the document is viewed in the list frame, allowing an easy selections of other domains and other documents. Finall, with -w, the entire index window is used for viewing pages.

Selecting Lists & domains or Simple List from the status frame, one can return immediately to the index of processed pages, but in the first case the default layout (domains + list) is used, while in the second the list area takes all the space below the status frame.

Cache generated by Netscape for Windows

In this case one must use the -p option.
It's better to mount the dos partition with type msdos rather than vfat because in the first case access is faster and file names aren't case sensitive.

Installation

This software is available in a package containing both source and binary versions.
It can be obtained at
ftp://sunsite.unc.edu/pub/Linux/apps/www/plugins and at
https://members.tripod.com/~giustrov/download.html

For using this program, you must have installed the DB library. It's necessary to read records stored in the index.db file.
In practice you need libdb.so to run the compiled version, and also db include files to compile the program.
For Linux, with Slackware and Redhat distributions, the library should be present by default.
For the include files, with Redhat you must install a package called db-devel or similar. For Slackware, they are in libc.tgz, so they aren't a problem.

For compiling, cd to src subdir and do make.
Do make install to compile and copy the executable in /usr/bin, the man page in /usr/man/man1 and the documentation in /usr/doc/nolce-VERSION.
If standard destinations don't fit your taste, modify them in the Makefile.

Compatibility

I have tested the program under Linux only, and with Netscape Navigator 3.01, 4.0b5, and 4.03 .
Probably it works with version 2.0 also, since the present format of the cache was introduced with this release.
It should work also with other Unix, if their Netscape indexes its cache in the same way of the linux version, that is with a DB hash file named index.db under $HOME/.netscape/cache.
If the name is different, it's easy to change the value of CACHE_FILE, in the defines section of the source file.
From the point of view of the language, I use code conforming to ANSI C or POSIX standards only, so if your system supports them, there must be no problems.

As I know, the following circumstances may cause problems or errors in compiling nolce:

Makefile assumes that your make correctly defines the variable CC as your site compiler name (i.e cc or gcc). This must be ensured by every make, but if not, define them by hand.
The flex program used must be a real flex, that is not an emulation of the original lex, as it happens using flex -l. This is what happens on some Slackware systems, where flex calls the real program flex.slk with the -l option. The result is a segmentation fault error when nolce is executed.
In this situation, adding -Darray to DEFINES in the Makefile (see below), solves the problem.
A line LEX=flex is present in the Makefile. On non Linux systems, this probably should be changed.
The behavior of the lex program may change. Apart from program options, it often requires linking with some libraries. The Linux standard lex, that is GNU flex, requires the -lfl library, and it's provided in the variable LDFLAGS of the Makefile.
If your site uses a different lex, read its documentation and change the Makefile accordingly.
Possible options needed by the program may be given in the LFLAGS variable.
The program interfaces with the lexical analyzer through the usual yylex() function, called in the process_html_file of main.c. Input and output files are supplied to yylex with the extern variables yyin and yyout. Probably this is not conforming with original AT&T lex, but, as I know, it conforms to POSIX specification for lex, and, above all, it's almost the only way one can use with flex.
Flex defines yytext as a char pointer, while other lex may define it as a char array. If this is your case, you must compile main.c with the -Darray option, which can be done by setting the variable DEFINES of the Makefile.

If problems persist, send me an e-mail, describing, besides the problem, what system you're using, what lex and so on. But I haven't access to other systems further my Linux machine, so, don't expect a certain solution.

If you discovery a bug, i.e. an abnormal exit of the program with a Segmentation Fault error, please let me know. You should send me an e-mail with a brief description of the circumstances under which the error happened, command line options, and above all the core file generated by the program (compress it to avoid mail messages too heavy).
Shells permit to decide if one wants to obtain a core dump after an abnormal termination of a program. With bash see the command ulimit.
For being the core file useful to me, it must be generated by a program compiled with debug info: add the option -g3 to CFLAGS in the Makefile. If you have libg installed, add also -lg to LDFLAGS.

However, before sending the core file, it could be useful the simple output of gdb. In case of problems, compile nolce with debug infos, launch it from the debugger, and when the execution stops with the error, inside gdb, give the command bt and send to me the informations displayed.

How it works

i. INDEX.HTML

A lot of urls, i.e. http://home.netscape.com, don't contain an HTML file name.
In this situation the server provides a default HTML file, usually index.html, and nolce appends this same name to these urls.
It could happen that an HTML file contains a link to such an url with the file name explicited. If this name is different from index.html, the link doesn't work.

ii. LINKS

The main work nolce does is changing links in HTML files to point to local files.
There are various types of links (imagine you're browsing the document http://www.aaaa.com/bbb/index.html):

Relative links, i.e. HREF="ccc/image.gif". In this case the browser loads the file image.gif from the directory ccc under bbb.
Absolute links, i.e. HREF="http://www.aaaa.com/ccc/image.gif". In this case Netscape will always try to obtain the document from the net, so nolce transforms the link in something like "../ccc/image.gif".
Base-related links, i.e HREF="/ccc/image.gif". These links must be interpreted as http://www.aaaa.com/ccc/image.gif, not regarding of the directory in which the HTML files is.

If a link points to a document present in the cache, it is changed to a relative link, otherwise it's turned in an absolute link.

iii. LEX

If your lex program is GNU flex, the flag -Cf may be given to it (put in the variable LFLAGS of the Makefile). This makes the program bigger, but execution speeds up of 10-15%.

iv. MISCELLANEOUS

In the file nolce.h there are some defines which can be customized.
Links pointing to documents which are present in the cache are in italics. Obviously the HTML document can contain links which are in italics of origin, and in this case they may point to non-local files.
Besides, if a link is presented as a formatted text, i.e <h3>Link</h3>, the italics isn't shown.
If two or more versions of a document are present in the cache, the more recent is taken with its original name; for the others a progressive number is appended to the url.
Netscape seems to have problems to follow links to local files which contain characters like `?'. Mainly for this reason, when creating directories, strange characters like `?', `=', `(' and so on are substituted with an underscore.

Contacting the author

For any question, bug report or comment, email to g.trovato@usa.net
My home page is
https://members.tripod.com/~giustrov

Nolce web page is:
https://members.tripod.com/~giustrov/nolce.html