Web Page Management Software

John Hurst

Version 1.3.3

20091203:093814

This document defines and describes the suite of programs used to create my web page environment on the range of machines that I use.

1	Introduction
2	Literate Data
3	The Main Program index.py
	3.1	Define Various String Patterns
	3.2	Define Global Variables and Constants
	3.3	Handle JPEGs
		3.3.1	Get JPG File from Remote Server
	3.4	Collect HTTP Request
		3.4.1	Handle REDIRECT QUERY STRING
	3.5	Get Filename from Redirect
	3.6	Check for Abbreviated URL
	3.7	Make File and Dir Absolute
	3.8	Check for HTML Request
	3.9	Determine the Host and Server Environments
	3.10	Get Default XSLT File
	3.11	Scan for Locally Defined XSLT File
	3.12	Determine XSLT File
	3.13	Update Counter
	3.14	Process File
		3.14.1	Process an XML File
4	File Caching
	4.1	The File Cache Module
	4.2	Clearing the Cache
5	The Makefile
6	TODOs
7	Indices
	7.1	Files
	7.2	Chunks
	7.3	Identifiers

1. Introduction

This document describes the files used to manage delivery of my personal web pages, and those that I manage for other organisations. The general form of web page delivery is a) a source file written in XML, b) a translation file written in XSLT, and c) the program described here, a python cgi script that calls the appropriate translator on the source file, and delivers the result. It also handles straight HTML, as well as providing some debug and other maintenance options.

The program is invoked by commands in the .htaccess file associated with each web directory. Different .htaccess files can be used for different directories. If none exist in a given directory, the directory path is searched towards the root until one is found.

The XSLT files used can be specified either in the .htaccess file (default), or in the source XML file, through an explicit xml-stylesheet command. If a stylesheet XSLT file is specified, it overrides the default .htaccess one.

Permission is given to reuse this document, provided that the source is acknowledged, and that any changes are noted in the documentation.

The document is in the form of a literate program, and generates all files necessary to maintain the working environment, including a Makefile.

2. Literate Data

<edit warning 2.1> =

#
# DO NOT EDIT this file!
#   see $HOME/Computers/Sources/Web/web.xlp instead
#   this also gives further explanation of the program logic
#

Chunk referenced in 3.1

This message flags the fact that the source code is a derived document, and should not be directly edited..

3. The Main Program `index.py`

"index.py" 3.1 =

#!/usr/bin/python
<edit warning 2.1>
version="<current version .1>"

# This script processes all my web page XML files
# It requires apache to be configured:
#   a) to allow python files (this file) to run as cgi in user directories
#   b) to add a handler for XML files that calls this program
#   c) to pass as cgi parm the XSLT file that translates the XML file
# These are done in a .htaccess file for each directory (and its
# subdirectories) that require XML processing with a particular
# XSLT stylesheet.
#
# The script relies upon picking up the required file and its XSLT file
# from a) the REDIRECT environment variables, and b) the script parameter,
# respectively.
#
#

Chunk defined in 3.1,3.2,3.3,3.4,3.5,3.6,3.7

Start with a bit of explanatory comment in case direct access to this literate program is unavailable.

The interpreter required varies according to the target server. This detail is captured by the <Makefile 5.2> script, although not all systems have yet been encoded into the Makefile script.

"index.py" 3.2 =

import cgi ; import cgitb ; cgitb.enable()

import cgi
import commands
import datetime
import filecache
import os, os.path
import re
import sys
from subprocess import PIPE,Popen
import time
import urllib2
import urlparse
import xml.dom.minidom

Chunk defined in 3.1,3.2,3.3,3.4,3.5,3.6,3.7

Gather together all the module and library imports needed for this program.

"index.py" 3.3 =

now=datetime.datetime.now()
tsstring=now.strftime("%Y%m%d:%H%M")
todayStr=now.strftime("%d %b %Y")
htmlmod=xmlmod=0

Chunk defined in 3.1,3.2,3.3,3.4,3.5,3.6,3.7

Start processing. Get a timestamp for recording key events in the log. Set the modification times to year dot.

"index.py" 3.4 =

<determine the host and server environments 3.22>
<define global variables and constants 3.9,3.10,3.11>
<check for and handle jpegs 3.12>

Chunk defined in 3.1,3.2,3.3,3.4,3.5,3.6,3.7

"index.py" 3.5 =

# start the html output
print "Content-type: text/html\n"
#print "<p>NEW TEST MESSAGE!</p>\n"
#print "*** server=%s ***" % server

<define various string patterns 3.8>

Chunk defined in 3.1,3.2,3.3,3.4,3.5,3.6,3.7

This fragment simply outputs the required header for flagging the generated content as HTML.

"index.py" 3.6 =

<collect HTTP request 3.15>
<get filename from redirect 3.17>

Chunk defined in 3.1,3.2,3.3,3.4,3.5,3.6,3.7

These two fragments explore the incoming parameters in order to find out what file is to be processed.

"index.py" 3.7 =

<check for abbreviated URL 3.18>
<make file and dir absolute 3.19>
<check for HTML request 3.20,3.21>
<get default XSLT file 3.23>
<scan for locally defined XSLT file 3.24>
<determine xslt file 3.26>
<update counter 3.27,3.28,3.29,3.30>

if debug:
  print "\n<p>\n"
  print "%s: server        = %s<br/>" % (tsstring,server)
  print "%s: host          = %s<br/>" % (tsstring,host)
  print "%s: dir           = %s<br/>" % (tsstring,dir)
  print "%s: requestedFile = %s<br/>" % (tsstring,requestedFile)
  print "%s: relcwd        = %s<br/>" % (tsstring,relcwd)
  print "%s: relfile       = %s<br/>" % (tsstring,relfile)
  print "%s: counter       = %s<br/>" % (tsstring,counterName)
  print "%s: alreadyHTML   = %s<br/>" % (tsstring,alreadyHTML)
  print "%s: cachedHTML    = %s<br/>" % (tsstring,cachedHTML)
  print "%s: os.environ    = %s\n</p>\n" % (tsstring,repr(os.environ))

<process file 3.31,3.32>

now=datetime.datetime.now()
tsstring=now.strftime("%Y%m%d:%H%M")
#sys.stderr.write("%s: [client %s] request satisfied\n" % (tsstring,remoteAdr))

Chunk defined in 3.1,3.2,3.3,3.4,3.5,3.6,3.7

At this point, processing is complete, and the program falls through to exit.

3.1 Define Various String Patterns

<define various string patterns 3.8> =

#  - to extract directory and filename from request
filepat=re.compile('/~ajh/?(.*)/([^/]*)$')
filename='index.xml'

#  - to detect stylesheet request (optional)
stylesheet=re.compile('<\?xml-stylesheet.*href="(.*)"')

#  - to terminate file scanning
doctype=re.compile('<!DOCTYPE')

# to check for missing htmls
htmlpat=re.compile('(.*)\.html$')

# to check for xslspecification in htaccess
xslspec=re.compile('.*?xslfile=(.*)&')

Chunk referenced in 3.5

3.2 Define Global Variables and Constants

<define global variables and constants 3.9> =

debug=False
returnXML=False
convertXML=False
alreadyHTML=False
cachedHTML=False
xslfile=""

Chunk referenced in 3.4
Chunk defined in 3.9,3.10,3.11

returnXML is set True when the display of the raw untranslated XML is required.

convertXML is set True when a converted copy of the translated XML is required to be saved.

alreadyHTML is set True when the incoming file to be rendered is already in HTML and does not require conversion.

cachedHTML is set True when the incoming file to be rendered has been cached in the HTMLS directory, and does not require conversion.

<define global variables and constants 3.10> =

# BASE is the path to the web base directory - with no trailing slash!
if system==MacOSX:
  HOME="/home/ajh/"
  BASE="/home/ajh/www"
  host=re.sub('dyn-130-194-\d+-\d+','dyn',host)
  PRIVATE="/home/ajh/local/"+host
elif system==Solaris:
  HOME="/u/staff1/ajh"
  BASE="/u/web/homes/ajh"
  PRIVATE=BASE+"/local/"+host
elif system==Linux:
  if docRoot == '/var/www/cerg':
    HOME="/var/www/cerg/"
    BASE="/var/www/cerg/"
    PRIVATE="/var/www/cerg/local/"+host
  elif docRoot == '/home/ajh/public_html/parish/GWICC':
    HOME="/home/ajh/public_html/parish/GWICC"
    BASE=HOME
    PRIVATE="/home/ajh/local/"+host+"/parish/GWICC"
  else:
    HOME="/home/ajh/"
    BASE="/home/ajh/www"
    PRIVATE="/home/ajh/local/"+host
COUNTERS=PRIVATE+"/counters/"
HTMLS=PRIVATE+"/htmls/"
WEBDIR="file://"+BASE

Chunk referenced in 3.4
Chunk defined in 3.9,3.10,3.11

BASE is set to the path to the web root directory on this server. It should not have a trailing slash!

HOME has its usual Unix meaning.

PRIVATE is set to the path to a working directory on this particular server that is used to store accounting and audit information about this particular access. The path includes a specific reference to the server hostname to uniquely distinguish it.

HTMLS is the path to a local copy of html versions of the files. These are cached versions, and some mechanism to age and delete needs to be identified. If the corresponding XML file is older than the HTML file found in this subdirectory, the HTML version is used.

<define global variables and constants 3.11> =

# define the XSLTPROC
if system==MacOSX:
  XSLTPROC="/usr/bin/xsltproc"
elif system==Solaris:
  XSLTPROC="/usr/monash/bin/xsltproc"
elif system==Linux:
  XSLTPROC="/usr/bin/xsltproc"

Chunk referenced in 3.4
Chunk defined in 3.9,3.10,3.11

XSLTPROC is the path to the xsltproc processor. Without this processor, this entire script (as far as XML files are concerned) is meaningless!

3.3 Handle JPEGs

From version 1.2.0 onwards, this code implements a form of caching for jpg files. A local check for the request file is made, and if it is not found, an attempt to retrieve it from the dimboola server is made. If that is not successful, the file is reported not found. If it is successful, the file is saved locally. No attempt is made to age files out of the cache.

<check for and handle jpegs 3.12> =

#sys.stderr.write("Just a check version 1.2.3\n")
cachetime=60*60*24*7 # one week
# check for jpgs
if os.environ.has_key('REQUEST_URI'):
  uri=os.environ['REQUEST_URI']
  (scheme,netloc,path,parms,query,fragment)=urlparse.urlparse(uri)
  #sys.stderr.write("path=%s\n" % path)
  filename=re.sub('/~ajh/','/home/ajh/www/',path)
  (root,ext)=os.path.splitext(filename)
  ext=ext.lower()
  if ext=='.jpg':
    basedir=os.path.dirname(filename)
    if os.path.exists(filename):
      #sys.stderr.write("Got file %s locally\n" % filename)
      f=open(filename,'r').read()
    else:
      <get jpg file from remote server 3.13,3.14>
    print "Content-Type: image/jpeg\n"
    print f # display the image
    sys.exit(0)
  else:
    pass

Chunk referenced in 3.4

This code checks to see if the request is for a jpg image file. These are cached, and if not present, are retrieved from the master jpg server for my jpeg images. This is still a bit experimental. The server URL is dimboola.infotech.monash.edu.au/~ajh/Pictures.

It requires that the .htaccess file be modified to refer .jpg requests to this cgi script.

3.3.1 Get JPG File from Remote Server

<get jpg file from remote server 3.13> =

newurl="http://dimboola.infotech.monash.edu.au%s" % path
#sys.stderr.write("using url %s\n" % newurl)
urlobj=urllib2.urlopen(newurl)
f=urlobj.read()
modtimestr=urlobj.info()['Last-Modified']
modtime=time.strptime(modtimestr,"%a, %d %b %Y %H:%M:%S %Z")

Chunk referenced in 3.12
Chunk defined in 3.13,3.14

Generate the URL of the corresponding remote JPG file, and issue read request. By using the urllib2 library, we also get the modification time, which we parse in order to set the correct modification time on the locally cached copy.

<get jpg file from remote server 3.14> =

try:
  fc=open(filename,'w')
  fc.write(f)
  fc.close()
  #touch filename -mt time.strftime("%Y%m%d%H%M.%S")
  mtime=time.mktime(modtime)
  imtime=int(mtime)
  nowtime=time.localtime()
  currtime=int(time.mktime(nowtime)) # local
  os.utime(filename,(currtime,imtime))
  #sys.stderr.write("%s: cached %s\n" % (tsstring,filename))
except IOError,OSError:
  #errmsg=os.strerror(errcode)
  sys.stderr.write("%s: Cannot write cache file %s\n" % (tsstring,filename))

Chunk referenced in 3.12
Chunk defined in 3.13,3.14

Now try to cache a local copy. This can fail for several reasons, the main one being that the permissions in the local directory are likely to be against (write) access by the www user. More work is required to make this a bit more robust.

Note that we set the pair (access time, modification time) on the local file to be the current time and remote file modification time respectively. This ensures that attempts to synchronize the two file systems will see this file as the same file as the remote file, and not attempt to update one or the other (thus leading to spurious modification times).

3.4 Collect HTTP Request

<collect HTTP request 3.15> =

# collect the original parameters from the redirect (if there is one!)
if os.environ.has_key('REDIRECT_QUERY_STRING'):
  <handle redirect query string 3.16>
else:
  form={}

requestedFile="" {Note 3.15.1}
  
remoteAdr=''
if os.environ.has_key('REMOTE_ADDR'):
  remoteAdr=os.environ['REMOTE_ADDR']

if debug:
  print "<p>%s: (server,host)=(%s,%s)<br/>\n" % (tsstring,server,host)
  print "%s: (system,PRIVATE)=(%s,%s)</p>\n" % (tsstring,system,PRIVATE)
  print "%s: (BASE,HOME,PRIVATE)=(%s,%s,%s)</p>\n" % (tsstring,BASE,HOME,PRIVATE)

Chunk referenced in 3.6

{Note 3.15.1}: initialize the filename of the file to be rendered. Most of the work in computing the value of this variable is done in <get filename from redirect 3.17>

When this script is called, it has gained control by virtue of an .htaccess directive to Apache to use this program to render the source file. The name of that source file has to be recovered somehow, and different systems seem to handle this parameter in different ways. The first parameter to explore is the REDIRECT_QUERY_STRING, which, if it is present in the form request, contains secondary parameters to the rendering operation. If this parameter is not present, initialize the variable form to an empty value.

3.4.1 Handle REDIRECT QUERY STRING

<handle redirect query string 3.16> =

query_string=os.environ['REDIRECT_QUERY_STRING']
form=cgi.parse_qs(query_string)
if form.has_key('debug') and form['debug'][0]=='true':
  sys.stderr.write("%s: %s\n" % (tsstring,repr(form)))
  debug=True
  print "<h1>%s: INDEX.PY version %s</h1>\n" % (tsstring,version)
  print "<p>%s: os.environ=%s</p>\n" % (tsstring,repr(os.environ))
  print "<p>%s: form=%s</p>\n" % (tsstring,repr(form))
  sys.stderr.write("%s: redirect_query string=%s\n" % (tsstring,query_string))
if form.has_key('xml'):
  if form['xml'][0]=='true':
    sys.stderr.write("%s: %s\n" % (tsstring,repr(form)))
    returnXML=True
    if debug:
      print "<p>%s: os.environ=%s</p>\n" % (tsstring,repr(os.environ))
      print "<p>%s: form=%s</p>\n" % (tsstring,repr(form))
      sys.stderr.write("%s: redirect_query string=%s\n" % \
          (tsstring,query_string))
  elif form['xml'][0]=='convert':
    convertXML=True

Chunk referenced in 3.15

There are several possibilities for secondary parameters. The primary one is the debug parameter, which can be set to true, indicating that debugging information is to be printed along with the rendering. This is intended for administrator access only, but as it is harmless, there is no authentication required.

The other parameter that can be offered at this point is the xml parameter, with values of true or convert. The first of these forces no conversion of the XML, but simply copies it to the browser, substituting escape sequences for any special XML character sequences so that it appears as verbatim XML.

The second choice, convert, allows the use of the rendering engine as an XML-to-HTML converter, in which case a copy of the converted HTML is saved to a temporary file. This file can be used subsequently as a statically converted file as necessary.

3.5 Get Filename from Redirect

<get filename from redirect 3.17> =

# get the file name from the redirect environment
if system==MacOSX:
  scriptURL='REQUEST_URI'
elif system==Solaris:
  scriptURL='REDIRECT_URL'
elif system==Linux:
  scriptURL='REDIRECT_URL'
if os.environ.has_key(scriptURL):
  requestedFile=os.environ[scriptURL]
  argpos=requestedFile.find('?')
  if argpos>=0:
    requestedFile=requestedFile[0:argpos]
  if debug:
    sys.stderr.write("%s: [client %s] requesting %s\n" % \
        (tsstring,remoteAdr,requestedFile))
orgfile=requestedFile
# analyse file request. If a bare directory, add 'index.xml'
if os.environ.has_key('REDIRECT_STATUS') and \
   os.environ['REDIRECT_STATUS']=='404':
  res=htmlpat.match(requestedFile)
  if res:
    filename=res.group(1)+'.xml'
    requestedFile=filename
dir=relcwd=""
res=filepat.match(requestedFile)
if res:
  dir=res.group(1)
  relcwd=dir
  # protocol for relcwd:
  #   no subdir     => relcwd = '' (empty)
  #   exists subdir => relcwd = subdir (no leading or trailing slash)
  if dir!="":
    requestedFile=dir+'/'+res.group(2)
  else:
    requestedFile=res.group(2)
  filename=res.group(2)
else:
  # not ajh (sub)directory, extract full directory path
  dir=os.path.dirname(requestedFile)
  relcwd=dir
  filename=os.path.basename(requestedFile)

if debug:
  print "<p>%s: dir,requestedFile,relcwd to process = %s,%s,%s</p>" % \
      (tsstring,dir,requestedFile,relcwd)

Chunk referenced in 3.6

3.6 Check for Abbreviated URL

<check for abbreviated URL 3.18> =

if requestedFile=='' or requestedFile[-1]=='/':
  requestedFile+='index.xml'
  filename='index.xml'

Chunk referenced in 3.7

3.7 Make File and Dir Absolute

<make file and dir absolute 3.19> =

requestedFile=re.sub('^/','',requestedFile) # remove any leading /
relfile=requestedFile
requestedFile=BASE+'/'+requestedFile
dir=BASE+'/'+dir

Chunk referenced in 3.7

3.8 Check for HTML Request

We now have a requestedFile name for the document to be rendered. We need to investigate this file to see how it is to be rendered. In particular, it may be an HTML file (indicated by a .html extension), or it may be an XML file previously rendered and cached. In these cases, we do not need to do any XML conversion, and the flag alreadyHTML is set true if it is an HTML file, or the flag cachedHTML is set true if it is a cached converted XML to HTML file.

<check for HTML request 3.20> =

res=htmlpat.match(requestedFile)
if res:
  # we have an HTML request, check if it exists
  if os.path.exists(requestedFile):
    # exists, use that
    alreadyHTML=True
    if debug: print "requested file %s is already html<br/>" % (htmlpath)
  else:
    # doesn't exist, convert from HTML
    filename=res.group(1)+'.xml'
    requestedFile=filename

Chunk referenced in 3.7
Chunk defined in 3.20,3.21

This code now also checks for a cached version of the XML file, as per the following fragment.

<check for HTML request 3.21> =

if not alreadyHTML:
  patn="(%s/)(.*).xml" % (BASE)
  if debug: 
    print "<p>matching xml=%s with pattern=%s<br/>" % (requestedFile,patn)
  res=re.match(patn,requestedFile)
  if res:
    base=res.group(1); path=res.group(2)
    if debug: print "matched BASE=%s,path=%s<br/>" % (base,path)
    htmlpath="%s%s.html" % (HTMLS,path)
    if os.path.exists(htmlpath):
      htmlstat=os.stat(htmlpath)
      xmlstat=os.stat(requestedFile)
      htmlmod=htmlstat.st_mtime
      xmlmod=xmlstat.st_mtime
      if xmlmod < htmlmod and not form:
        # cached version is newer use that
        if debug: print "using cached file %s<br/>" % (htmlpath)
        requestedFile=htmlpath
        cachedHTML=True
    else:
      if debug: print "no cached version of %s<br/>" % (requestedFile)
  else:
    if debug: print "requested file %s is not XML<br/>" % (requestedFile)

Chunk referenced in 3.7
Chunk defined in 3.20,3.21

Unless the file being retrieved is already an HTML file, check to see if we have a cached HTML version of this (XML) file. Note that any parameters to th http request (indicated by a non-empty form value) will abort the caching process, and force a reload of the XML file.

3.9 Determine the Host and Server Environments

<determine the host and server environments 3.22> =

# determine which host/server environment
host=commands.getoutput('hostname')
host=re.split('\.',host)[0] # break off leading part before the '.' char

try:
  host=os.environ["HOSTNAME"]
except KeyError:
  cmd='/bin/hostname'
  pid=Popen(cmd,shell=True,stdout=PIPE,stderr=PIPE,close_fds=True)
  host=pid.communicate()[0].strip()
try:
  server=os.environ["SERVER_NAME"]
except KeyError:
  server='localhost'
try:
  docRoot=os.environ["DOCUMENT_ROOT"]
except KeyError:
  docRoot='/Users/ajh/www'
if os.environ.has_key("SCRIPT_URL"):
  URL=os.environ["SCRIPT_URL"]
elif os.environ.has_key("REDIRECT_URL"):
  URL=os.environ["REDIRECT_URL"]
else:
  URL="We got a problem"

# determine the server and host names
#
# the server is the address to which this request was directed, and is
# useful in making decisions about what to render to the client.
# Examples are "localhost", "www.ajh.id.au", "chairsabs.org.au".
#
# the host is the machine upon which the server is running, and may be
# different from the server.  This name is used to determine where to
# store local data, such as logging information.  For example, the
# server may be "localhost", but this can run on a variety of hosts:
# "murtoa", "dimboola", dyn-13-194-xx-xx", etc..  Incidentally, hosts
# of the form "dyn-130-194-xx-xx" are mashed down to the generic "dyn".

MacOSX='MacOSX' ; Solaris='Solaris' ; Linux="Linux"
system=MacOSX # unless told otherwise
if server in ["localhost"]:
  pass
elif server in ["www.csse.monash.edu.au"]:
  host='csse' ; ostype='Solaris'
  system=Solaris
elif server in ['www.ajhurst.org','ajhurst.org','eregnans.ajhurst.org',\
                'chairsabs.org.au','cahurst.org',\
               'glenwaverleychurches.org','www.glenwaverleychurches.org',\
                'njhurst.com','www.njhurst.com','regnans.njhurst.com']:
  host='eregnans' ; ostype='Linux'
  system=Linux
elif server in ['cerg.infotech.monash.edu.au','cerg.csse.monash.edu.au']:
  host='cerg' ; ostype='Linux'
  server='cerg.infotech.monash.edu.au' # repoint the server address
  system=Linux
elif server in ['murtoa.local','murtoa.infotech.monash.edu.au']:
  host='murtoa'
elif server in ['dimboola','dimboola.local','dimboola.ajh.id.au',
                'ajh.id.au','www.ajh.id.au']:
  host='dimboola'
elif server in ['ararat','ararat.local']:
  host='ararat'
elif server in ['10.0.0.110','ajh.id.au','www.ajh.id.au']:
  server='www.ajh.id.au'
  host='dimboola'
elif server[0:11]=='dyn-130-194-':
  server='localhost'
  host='dyn'
else:
  sys.stderr.write("Could not determine server/host values\n")
  sys.stderr.write("(supplied values=%s/%s)\n" % (server,host))
  sys.stderr.write("Using default values \"localhost/murtoa\"\n")
  server='localhost'
  host='murtoa'
  ostype='MacOSX'
  system=MacOSX

Chunk referenced in 3.4

3.10 Get Default XSLT File

<get default XSLT file 3.23> =

# collect the XSLT file name from the .htaccess referent
if os.environ.has_key('QUERY_STRING'):
  query_string=os.environ['QUERY_STRING']
else:
  query_string='xslfile=%s/lib/xsl/ajhwebdoc.xsl&/~ajh/index.xml' % (BASE)
if debug:
  sys.stderr.write("%s: query string=%s\n" % (tsstring,query_string))
form2=cgi.parse_qs(query_string)
if debug:
  print("<p>%s: form2=%s</p>\n" % (tsstring,repr(form)))
if form2.has_key('xslfile'):
  xslfile=form2['xslfile'][0]
  if debug:
    print "<p>%s: got xslfile=%s</p>\n" % (tsstring,xslfile)

Chunk referenced in 3.7

3.11 Scan for Locally Defined XSLT File

<scan for locally defined XSLT file 3.24> =

# Check the requested file for a local stylesheet.  We also scan the
# entire file, replacing any symbolic references to $WEBDIR with the
# full path for the current machine. Note that the DOCTYPE statement
# must start a line by itself.
try:
  filed=open(requestedFile)
  text='' ; linecount = 0
  trackXML=debug and not (alreadyHTML or cachedHTML)
  while 1: # keep scanning file until we find no more XML directives
    line=filed.readline()
    if line=='': # this is EOF, so quit
      if linecount==0:
        print "(empty file)"
      break
    linecount+=1
    line=line.strip() # remove NL
    text+=' '+line
    if trackXML:
      print "<p>read line='%s'" % (cgi.escape(line))
    # check if end of directives, indicated by normal element tag start
    res=re.match('<[^?!]',line)
    if res:
      break
  if trackXML:
    print "<p>text read='%s'" % (cgi.escape(text))
  res=re.match('.*(<\?xml-stylesheet)(.*?)(\?>)',text)
  if res:
    parms=res.group(2)
    # now we have the stylesheet parameters
    res=re.match('.*href="(.*?)"',parms)
    if res:
      # extract filename
      xslfile=res.group(1)
      xslfile=re.sub('(\$WEBDIR)',WEBDIR,xslfile)
      if debug:
        print "<p>%s: stylesheet in xml file, href=%s</p>" % (tsstring,xslfile)
      <check if xslfile more recent than cached version 3.25>
    else:
      if trackXML:
        print "<p>Did not find stylesheet href in %s" % (parms)
  else:
    if trackXML:
      print "<p>Did not find stylesheet reference in %s" % (cgi.escape(text))
  filed.close()
except:
  print """
    <h1>Sorry!! (Error 404)</h1>
    <p>While processing your request for file %s,<br/>
    it was found that the corresponding XML file %s does not exist</p>
    <p>Please check that the URL is correct</p>
    Exception values: (%s,%s,%s)
    """ % (orgfile,requestedFile,sys.exc_type,sys.exc_value,sys.exc_traceback)
  sys.exit(0)
#newfiled.close()

Chunk referenced in 3.7

<check if xslfile more recent than cached version 3.25> =

localXSLfile=re.sub('file://','',xslfile)
try:
  xslmod=os.stat(localXSLfile)
  if htmlmod < xslmod:
    cachedHTML=False
    if debug: print "<p>XSL newer than HTML, reloading</p>"
except: # ignore any errors from this
  pass

Chunk referenced in 3.24

Look at modification time of XSL file. If it is more recent than the cached HTML file, we must re-convert the XML file.

3.12 Determine XSLT File

<determine xslt file 3.26> =

# have we got an xslfile yet?
htacc=None
if xslfile=="":
  # no, so check all .htaccess
  # first grab directory
  while len(dir)>=len(BASE):
    if debug:
      print "<p>directory=%s</p>\n" % (dir)
    if os.path.isfile(dir+"/.htaccess"):
      htacc=open(dir+"/.htaccess")
      if debug:
        print "<p>found .htaccess in directory %s</p>" % (dir)
      break
    else:
      dir=os.path.dirname(dir)
if htacc:
  for line in htacc.readlines():
    res=xslspec.match(line)
    if res:
      xslfile=res.group(1)
      if xslfile[0] != '/': 
        xslfile=BASE+'/'+xslfile
      break
  if debug:
    print "<p>found xslfile %s in .htaccess</p>" % (xslfile)

if system==Solaris:
  xslfile=re.sub('/home/ajh'+'/www','/u/web/homes/ajh',xslfile)
  if xslfile[0]!='/' and not (xslfile[0:5]=='file:'):
    xslfile='/u/web/homes/ajh/'+xslfile

Chunk referenced in 3.7

3.13 Update Counter

Compute the name of an XML counter file which contains a counter element with subelements value and date. The value element contains the current count value, and the date element is the date on which this XML file was initialised. We read the current count from that file, increment it, and update the file. This file is used by most xslt translations to output an access count in the footer. It is also used by the site map program to compute the intensity of accesses to this web page.

It was fortuitous, but this counter also keeps track of HTML accesses, both where an HTML file is the initial request, and where it is a cached version of the corresponding XML file. Since the XML files have their own counters included by the XSLT translator, the count attached to the HTML rendering allows a comparision of how many accesses are to the cached copy (the difference between the two).

For example, suppose the XML rendering gives 986 references, and the HTML rendering cites 993 references. The the cached HTML page has itself been referenced 7 times since it was first cached.

<update counter 3.27> =

counterName=re.sub("/~ajh/",'',relfile)
counterName=re.sub("^/",'',counterName)
extnPattern=re.compile("(.xml)|(.html)")
counterName=re.sub(extnPattern,'',counterName)
counterName=COUNTERS+re.sub("/","-",counterName)

Chunk referenced in 3.7
Chunk defined in 3.27,3.28,3.29,3.30

First we process relfile to find the counter name. Remove any extension, and replace all slash path separators with minus signs.

(Strictly speaking, the first sub is not required, but I've left it in, as it does no harm.)

<update counter 3.28> =

newCounterStr='<?xml version="1.0"?>\n'
newCounterStr+='<counter><value>0</value><date>%s</date></counter>' % todayStr
try:
  counterFile=open(counterName,'r')
  dom=xml.dom.minidom.parse(counterFile)
  counterFile.close()
except IOError:
  dom=xml.dom.minidom.parseString(newCounterStr)
except xml.parsers.expat.ExpatError:
  dom=xml.dom.minidom.parseString(newCounterStr)
except:
  print "Unexpected error:", sys.exc_info()[0]
  raise

Chunk referenced in 3.7
Chunk defined in 3.27,3.28,3.29,3.30

Now try to read the counter XML file. The file may not exist if this is the first time we have accessed this page since this mechanism was set up, so we must capture that error, and any error arising from attempting to parse the XML, and create a new counter file, with value initialised to zero, and date initialised to today's date.

<update counter 3.29> =

# now extract count field and update it
countNode=dom.getElementsByTagName('value')[0]
if countNode.nodeType == xml.dom.Node.ELEMENT_NODE:
  textNode=countNode.firstChild
  if textNode.nodeType == xml.dom.Node.TEXT_NODE:
    text=textNode.nodeValue.strip()
    countVal=int(text)
    countVal=countVal+1
    textNode.nodeValue="%d" % (countVal)
countDate='(unknown)'
countNode=dom.getElementsByTagName('date')[0]
if countNode.nodeType == xml.dom.Node.ELEMENT_NODE:
  textNode=countNode.firstChild
  if textNode.nodeType == xml.dom.Node.TEXT_NODE:
    countDate=textNode.nodeValue.strip()

Chunk referenced in 3.7
Chunk defined in 3.27,3.28,3.29,3.30

<update counter 3.30> =

# write updated counter document
try:
  counterFile=open(counterName,'w')
except IOError:
  counterName='/home/ajh/local/localhost/counters/index'
  counterFile=open(counterName,'w')
domString=dom.toxml()
counterFile.write(domString)
counterFile.close()

Chunk referenced in 3.7
Chunk defined in 3.27,3.28,3.29,3.30

3.14 Process File

<process file 3.31> =

# define the parameters to the translation
filestat=os.stat(requestedFile)
filemod=filestat.st_mtime
dtfilemod=datetime.datetime.fromtimestamp(filemod)
dtstring=dtfilemod.strftime("%Y%m%d:%H%M")
parms=""
parms+="--param xmltime  \"'%s'\" " % (dtstring)
parms+="--param htmltime \"'%s'\" " % (tsstring)
parms+="--param filename \"'%s'\" " % (filename)
parms+="--param relcwd   \"'%s'\" " % (relcwd)
parms+="--param URL      \"'%s'\" " % (URL)
parms+="--param today    \"'%s'\" " % (todayStr)
parms+="--param host     \"'%s'\" " % (host)
parms+="--param server   \"'%s'\" " % (server)
parms+="--param base     \"'%s'\" " % (BASE)
for key in form:
  value=form[key][0]
  parms+="--param "+key+" \"'%s'\" " % (value)
if debug:
  sys.stderr.write("%s: xml file modified at %s\n" % (tsstring,dtstring))

Chunk referenced in 3.7
Chunk defined in 3.31,3.32

<process file 3.32> =

if returnXML:
  rawxmlf=open(requestedFile,'r')
  print "<PRE>\n"
  for line in rawxmlf.readlines():
    print cgi.escape(line)
  print "</PRE>\n"
elif alreadyHTML or cachedHTML:
  <render the HTML file 3.33>
else:
  <process an XML file 3.34,3.35,3.36>

Chunk referenced in 3.7
Chunk defined in 3.31,3.32

Decide what to with the file. There are 3 choices:

return the raw XML. This means escaping all the active characters, and printing the file verbatim.
The file is HTML, either because of an explicit HTML request, or a cached HTML file previously translated has been found. Again, the file is rendered verbatim, this time without escaping the active characters.
It is an XML file, and it needs translation. Call the XSLT processor to do that (chunk <process an XML file 3.34,3.35,3.36>).

<render the HTML file 3.33> =

rawHTMLf=open(requestedFile,'r')
for line in rawHTMLf.readlines():
  print line,
print '<P><SPAN STYLE="font-size:80%%">'
print '%d accesses since %s, ' % (countVal,countDate)
print 'HTML cache rendered at %s</SPAN>' % (dtstring)
if cachedHTML:
  os.utime(requestedFile,None) # touch the file

Chunk referenced in 3.32

Note that each line from the HTML file is printed without additional line breaks.

3.14.1 Process an XML File

<process an XML file 3.34> =

# start a pipe to process the XSLT translation
cmd=XSLTPROC+" --xinclude %s%s %s " % (parms,xslfile,requestedFile)
#(pipein,pipeout,pipeerr)=os.popen3(cmd)
pid=Popen(cmd,shell=True,stdout=PIPE,stderr=PIPE,close_fds=True)
(pipeout,pipeerr)=(pid.stdout,pid.stderr)
if debug:
  cwd=os.getcwd()
  print "<p>%s: (cwd:%s) %s</p>" % (tsstring,cwd,cmd)
  sys.stderr.write("(cwd:%s) %s: %s\n" % (cwd,tsstring,cmd))

# report the fact, and the context (debugging purposes)
if debug:
  print "%s: converting %s with %s\n" % (tsstring,requestedFile,xslfile)

Chunk referenced in 3.32
Chunk defined in 3.34,3.35,3.36

Run the pipe to perform the translation. Note that this step requires an inordinate amount of time on some servers (sequoia in particular), and was the prompt for including the caching mechanism.

<process an XML file 3.35> =

# process the converted HTML
convertfn="/home/ajh/www/tmp/convert.html"
if convertXML:
  try:
    htmlfile=open(convertfn,'w')
  except:
    msg="couldn't open HTML conversion file %s" % convertfn
    sys.stderr.write("%s: %s\n" % (tsstring,msg))
    convertXML=False

Chunk referenced in 3.32
Chunk defined in 3.34,3.35,3.36

<process an XML file 3.36> =

# check that directory exists
dirpath=os.path.dirname(htmlpath)
if not os.path.isdir(dirpath):
  os.makedirs(dirpath,0777)
htmlfile2=open(htmlpath,'w')
for line in pid.stdout.readlines():
  print line,
  htmlfile2.write(line)
  if convertXML:
    htmlfile.write("%s\n" % line)
if convertXML:
  htmlfile.close()
htmlfile2.close()
os.chmod(htmlpath,0666)
<deal with any conversion errors 3.37>  
pipeout.close(); pipeerr.close()

Chunk referenced in 3.32
Chunk defined in 3.34,3.35,3.36

Note that in copying the rendered HTML version, we retain the lines as is, and make sure that they are rendered without any additional (or deleted) new lines.

3.14.1.1 Deal with any Conversion Errors

<deal with any conversion errors 3.37> =

errs=[]
for line in pipeerr.readlines():
  errs.append(line)
logfile=PRIVATE+'/xmlerror.log'
logfiled=open(logfile,'a')
if errs:
  logfiled.write("%s: ERROR IN REQUEST: %s\n" % (tsstring,requestedFile))
  print "<HR/>\n"
  print "<H3>%s: MESSAGES GENERATED BY: %s</H3>\n" % (tsstring,requestedFile)
  print "<PRE>"
  for errline in errs:
    #logfiled.write("%s: %s" % (tsstring,errline))
    errline=cgi.escape(errline)
    errline=errline.rstrip()
    print "%s: %s" % (tsstring,errline)
  print "</PRE>"
  print "<p>Please forward these details to "
  print "<a href='mailto:ajh@csse.monash.edu.au'>John Hurst</a>"
else:
  logfiled.write("%s: NO ERRORS IN %s\n" % (tsstring,requestedFile))
logfiled.close()

Chunk referenced in 3.36

4. File Caching

4.1 The File Cache Module

"filecache.py" 4.1 =

"""A module that writes a webpage to a file so it can be restored at a later time
Interface:
filecache.write(...)
filecache.read(...)
"""

import time
import os
import md5
import urllib

def key(url):
	k = md5.new()
	k.update(url)
	return k.hexdigest()

def filename(basedir, url):
	return "%s/%s.txt"%(basedir, key(url))


def write(url, basedir, content):
	""" Write content to cache file in basedir for url"""
        cachefilen=filename(basedir, url)
	fh = file(cachefilen, mode="w")
	fh.write(content)
	fh.close()
        return cachefilen

def read(url, basedir, timeout):
	"""Read cached content for url in basedir if it is fresher
        than timeout (in seconds)"""
        cache=0
	fname = filename(basedir, url)
	content = ""
	if os.path.exists(fname) and \
          (os.stat(fname).st_mtime > time.time() - timeout):
		fh = open(fname, "r")
		content = fh.read()
		fh.close()
                cache=1
	return (content,cache)

This code was adapted from an example given on a web page. Sorry, I have forgotten the reference.

4.2 Clearing the Cache

This program clears the HTML caches created by the previous module. It is called independently, and can clear either the entire cache, or subdirectories of it.

The cache is maintained on a per-machine basis, and the machine being used is identified by a hostname call.

"clearWebCache.py" 4.2 =

5. The Makefile

The Makefile handles the nitty-gritty of copying files to the right places, and setting permissions, etc.

<install python file 5.1> =

install.machine: /tmp/index-machine.py
	rsync -auv /tmp/index-machine.py address:homedir/www/cgi-bin/index.py
/tmp/index-machine.py: index.py
	sed -e 's#/sw/bin/python#interpreter#' <index.py >/tmp/index-machine.py

Chunk referenced in 5.2

"Makefile" 5.2 =

RELCWD       = /cgi-bin/
WEBPAGE      = /home/ajh/www/research/literate
FILES        = $(EMPTY)
XSLLIB       = /home/ajh/lib/xsl
XSLFILES     = $(XSLLIB)/lit2html.xsl $(XSLLIB)/tables2html.xsl
INSTALLFILES = index.py
CGIS         = $(INSTALLFILES)
XMLS         = $(EMPTY)
DIRS         = $(EMPTY)

include $(HOME)/etc/MakeXLP
include $(HOME)/etc/MakeWeb

index.py: web.tangle
	chmod 755 index.py
	touch index.py

web.tangle web.xml: web.xlp
	xsltproc --xinclude -o web.xml $(XSLLIB)/litprog.xsl web.xlp
	touch web.tangle
web.html: web.xml $(XSLFILES)
	xsltproc --xinclude $(XSLLIB)/lit2html.xsl web.xml >web.html

install: install.murtoa

web:    $(WEBPAGE)/web.html
$(WEBPAGE)/web.html: web.html
	cp -p web.html $(WEBPAGE)/web.html

Makefile: web.tangle

install.murtoa: index.py
	sed -e 's#/sw/bin/python#/sw/bin/python2.6#' <index.py >/Users/ajh/www/cgi-bin/index.py

<install python file 5.1>(machine='dimboola', address='dimboola', interpreter='/sw/bin/python2.6', homedir='/Users/ajh')
<install python file 5.1>(machine='bendigo', address='bendigo', interpreter='/sw/bin/python', homedir='/Users/ajh')
<install python file 5.1>(machine='sequoia', address='sequoia', interpreter='/usr/X11R6/bin/python', homedir='/home/ajh')
<install python file 5.1>(machine='eregnans', address='eregnans', interpreter='/usr/bin/python', homedir='/home/ajh')
<install python file 5.1>(machine='rainbow', address='rainbow', interpreter='/sw/bin/python', homedir='/home/ajh')
<install python file 5.1>(machine='csse', address='nexus', interpreter='/usr/monash/bin/python', homedir='/home/ajh')

The install.system targets are designed to cater for the variations in interpreters required for each of the servers installed by the Makefile.

Note that this has not been updated for setting the filecache module.

6. TODOs

20081111:103026: footer links filename is incorrect when a .html URL is used. Need to check calculation of filename.

7. Indices

7.1 Files

File Name	Defined in
Makefile	5.2
clearWebCache.py	4.2
filecache.py	4.1
index.py	3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7

7.2 Chunks

Chunk Name	Defined in	Used in
check for HTML request	3.20, 3.21	3.7
check for abbreviated URL	3.18	3.7
check for and handle jpegs	3.12	3.4
check if xslfile more recent than cached version	3.25	3.24
collect HTTP request	3.15	3.6
current date	.2
current version	.1	3.1
deal with any conversion errors	3.37	3.36
define global variables and constants	3.9, 3.10, 3.11	3.4
define various string patterns	3.8	3.5
determine the host and server environments	3.22	3.4
determine xslt file	3.26	3.7
get default XSLT file	3.23	3.7
get filename from redirect	3.17	3.6
get jpg file from remote server	3.13, 3.14	3.12
handle redirect query string	3.16	3.15
install python file	5.1	5.2, 5.2, 5.2, 5.2, 5.2, 5.2
make file and dir absolute	3.19	3.7
process an XML file	3.34, 3.35, 3.36	3.32
process file	3.31, 3.32	3.7
render the HTML file	3.33	3.32
scan for locally defined XSLT file	3.24	3.7
update counter	3.27, 3.28, 3.29, 3.30	3.7

7.3 Identifiers

Identifier	Defined in	Used in
BASE
HOME
HTMLS	3.10	3.21
HTMLS		3.21
PRIVATE
alreadyHTML	3.9	3.20, 3.21, 3.24, 3.32
cachedHTML	3.9	3.21, 3.24, 3.32, 3.33
convertXML	3.9	3.16, 3.35, 3.35, 3.36, 3.36
debug	3.9	3.7, 3.15, 3.16, 3.16, 3.17, 3.23, 3.23, 3.23, 3.24, 3.24, 3.26, 3.26, 3.26, 3.31, 3.34, 3.34
htmlpat	3.8	3.17, 3.20
requestedFile	3.15	3.7, 3.17, 3.17, 3.17, 3.17, 3.17, 3.17, 3.17, 3.17, 3.17, 3.17, 3.17, 3.17, 3.17, 3.17, 3.18, 3.18, 3.18, 3.19, 3.19, 3.19, 3.20, 3.20, 3.20, 3.21, 3.21, 3.21, 3.21, 3.21, 3.21, 3.24, 3.24, 3.31, 3.32, 3.33, 3.33, 3.34, 3.34, 3.37, 3.37, 3.37
returnXML	3.9	3.16, 3.32

Document History

20080816:144135	ajh	1.0.0	first version under literate programming
20080817:131040	ajh	1.0.1	general restructuring
20080822:162138	ajh	1.0.2	more restructuring
20081102:164507	ajh	1.1.0	added jpg handling and caching
20081106:134033	ajh	1.1.1	added exception handling
20090507:160328	ajh	1.2.0	bug relating to non-~ajh directories fixed; caching of files implemented.
20090701:175934	ajh	1.3.0	added code to cache the converted HTML file. Still to do: creation of subdirectories for cached files.
20090702:182341	ajh	1.3.1	Subdirectories now created. Cached file is touched on each access
20090703:105343	ajh	1.3.2	some literate tidy ups, and renamed variable `file` to requestedFile to disambiguate it.
20091203:093814	ajh	1.3.3	updated Makefile to install python interpreter dependent files

<current version .1> = 1.3.3

Chunk referenced in 3.1

<current date .2> = 20091203:093814

Web Page Management Software

John Hurst

Version 1.3.3

20091203:093814

Table of Contents

1. Introduction

2. Literate Data

3. The Main Program index.py

3.1 Define Various String Patterns

3.2 Define Global Variables and Constants

3.3 Handle JPEGs

3.3.1 Get JPG File from Remote Server

3.4 Collect HTTP Request

3.4.1 Handle REDIRECT QUERY STRING

3.5 Get Filename from Redirect

3.6 Check for Abbreviated URL

3.7 Make File and Dir Absolute

3.8 Check for HTML Request

3.9 Determine the Host and Server Environments

3.10 Get Default XSLT File

3.11 Scan for Locally Defined XSLT File

3.12 Determine XSLT File

3.13 Update Counter

3.14 Process File

3.14.1 Process an XML File

3.14.1.1 Deal with any Conversion Errors

4. File Caching

4.1 The File Cache Module

4.2 Clearing the Cache

5. The Makefile

6. TODOs

7. Indices

7.1 Files

7.2 Chunks

7.3 Identifiers

Document History

3. The Main Program `index.py`