Module Weblogs


module Weblogs: sig .. end
Import and parse log files, detect search engines, etc. Copyright (C) 2005-2006 Merjis Ltd (http://www.merjis.com/). $Id: Weblogs.html,v 1.4 2006/07/19 09:44:38 rich Exp $


type row = {
   src_ip : string; (*Remote IP address.*)
   remote_username : string option; (*HTTP authentication username.*)
   username : string option; (*Username field.*)
   t : Calendar.t; (*Date and time of record.*)
   http_method : http_method; (*HTTP method.*)
   full_url : string; (*Complete URL and query string.*)
   url : string; (*URL only.*)
   qs : string option; (*Query string, unparsed.*)
   args : (string * string) list; (*Query string, parsed.*)
   http_version : string option; (*"HTTP/x.y"*)
   rcode : int option; (*HTTP response code.*)
   size : int option; (*Response size.*)
   referer : string option; (*HTTP Referer field.*)
   user_agent : string option; (*HTTP User-Agent field.*)
   clickstream : string; (*User-tracking cookie - see note.*)
   server_sitename : string option; (*Server sitename (IIS).*)
   server_computername : string option; (*Server name (IIS).*)
   server_ip : string option; (*Server IP (IIS).*)
   server_port : int option; (*Server port (IIS).*)
   time_taken : int option; (*Time taken (IIS).*)
   proxy : string option; (*Proxy - see note.*)
   original : string; (*Original webserver log line.*)
   filename : string; (*Original source filename.*)
   lineno : int; (*Original source line.*)
}
One row/record from a log file.

The clickstream field either comes from the user-tracking cookie if you have one, or else is synthesized by hashing the user-agent and the proxy or source IP addresses together. The clickstream field can be used to provide a semi-reliable method of detecting visitors.

Some large proxy services (notably AOL) generate requests for a single user from a round robin of different IP addresses. To accomodate these users, we map the actual src_ip to the proxy name proxy. If the src_ip doesn't match any of the known proxies, then proxy will be None. The proxy, if known, will be used for clickstream creation, instead of the src_ip. See also: http://webmaster.info.aol.com/proxyinfo.html and the file weblogs_proxies.ml.


type http_method =
| GET
| POST
| HEAD
| OTHER of string (*HTTP methods.*)
val string_of_row : row -> string
Display the row (the same as accessing the row.original field).
type t = row list 
List of records read from, eg, a file.
val import_file : ?filter:(row -> bool) -> string -> t
Import a log file into the internal format. This function makes a stab at guessing whether the format is Apache "combined format" or IIS, and will also uncompress gzip or bzip2 files on the fly (not ZIP files). Lines which cannot be parsed print a warning, and are ignored. If the file cannot be parsed at all, throws Invalid_argument. Returns a list of records in the same order that they were read from the file (which is NOT necessarily time order).

The optional ~filter parameter may be given to apply arbitrary filtering to rows which are loaded into memory - for example to only load rows between certain dates.

val sort : t -> t
Sort the rows into time order.
val exclude_local : ?filename:string -> t -> t
This is a convenience function which can be used to exclude known/test/local IP addresses. To use it, first create a file called $HOME/.weblogs.exclude which should contain one IP address per line. (It's often a good idea to put 127.0.0.1 in this file). Then call this function on your logfile rows. All IP addresses which appear in the file are deleted from the rows returned.

You can override the default filename ($HOME/.weblogs.exclude) by supplying an optional ~filename parameter.


type referer_class =
| KnownSearchEngine of string * string
| KnownDirectory of string
| KnownEmailService of string
| Other (*Class of a referer:

KnownSearchEngine (s, q) means the referer field is a known search engine s where the user typed query terms q. The query terms are converted to UTF-8 encoding for you (where possible).

KnownDirectory d is a known directory named d.

KnownEmailService s means the referer field is a known email service s.

Other means none of the above.

See also: Weblogs.referer_class, Weblogs.normalise_query

*)
val referer_class : row -> referer_class
Attempt to classify the referer field. This classification is necessarily incomplete for three main reasons: (1) The internal lists of known search engines, etc., are far from complete, and since there are many thousands of such services on the Internet, may never be. (2) Relies on the HTTP Referer field which is set by the client and may be spoofed. (3) A given record might conceivably fall into several different classes.

See also: Weblogs.normalise_query

val is_known_bot : row -> (string * float) option
Using the (possibly spoofed) HTTP User-Agent field and the IP address, see if this is a known bot.

Return None if not, or Some (name, confidence) where name is the name of the bot.

The float value returned is the degree of confidence in the positive result, between 0. (no confidence at all) and 1. (certainty). The currently defined levels are:

0.9 - User-Agent and IP address are both consistent with a known bot.

0.5 - User-Agent indicates a known bot, but from an unexpected IP address.

val is_web_browser : row -> string option
Using the (possibly spoofed) HTTP User-Agent field, see if the record corresponds to a known web browser.

Return None if not, or Some name where name is the name of the browser.

The current list of web browsers is very incomplete. See web_browsers.csv.

val normalise_query : string -> string
This "normalises" a search query, so that two search queries which search engines would consider identical are actually identical.

What this currently does:

Convert the string to lowercase in a UTF-8-sensitive way.

Removes whitespace at the beginning and end of the query.

Converts any whitespace in the middle of the query into a single space.

What it should do in future, but doesn't do right now:

UTF-8 normalisation.

Map alternate codepoints to normal forms (eg. Japanese single width katakana to ordinary katakana).

See also: Weblogs.referer_class

type visitor = row list 
A "visitor" is a list of entries which have probably come from the same human / web browser source, sorted into time order.
val detect_visitors : t -> visitor list
Attempt to detect human visitors from the log file. This uses several heuristics to remove bots (see the source code). Note that what you're really seeing here (if you can trust the results at all) are web browsers. A single human might use several different browsers to see the same site, and be counted as more than one visitor. This does not take into account humans who log in and could be detected as coming from more than one browser.

Large logfiles

val import_rows : string -> (row -> unit) -> unit
import_rows filename f imports the logfile one row at a time, passing each row to function f for processing.

The advantage is that because the file is not loaded into memory this function can handle very large logfiles. The disadvantage is that sorting and visitor detection are hard-to-impossible, but see Weblogs.start_visitors below.

type visitors_handle 
val start_visitors : ?directory:string -> unit -> visitors_handle
val open_visitors : ?directory:string -> unit -> visitors_handle
val import_visitor_row : visitors_handle -> row -> unit
val finalise_visitors : visitors_handle -> unit
val count_visitors : visitors_handle -> int
val iter_visitors : visitors_handle -> (visitor -> unit) -> unit
For really large logfiles where you want to do visitor analysis, use this interface. It is roughly equivalent to Weblogs.import_file followed by Weblogs.detect_visitors, but it can handle pretty much unlimited amounts of data because it uses the disk as an intermediate store.

To import log files and perform visitor analysis on them at the same time, do:

 let vh = start_visitors () in
 import_rows filename (import_visitor_row vh);
 finalise_visitors vh;

(You only need to do this import step once for a given log file. After that you can use open_visitors () to get the visitors_handle even from another program).

You can then use count_visitors vh to find the number of visitors or iter_visitors vh f to iterate function f over the visitors.

It is implemented by creating lots of disk files, one per visitor. These are placed in a directory, normally $TMPDIR/visitors or /tmp/visitors, but which you can control by passing the ~directory option to start_visitors or open_visitors.

Note that the data stored in visitors_handle doesn't give a complete picture of the contents of the original log files. In particular bot entries and some other bot-like rows will be discarded.

finalise_visitors is important. Do not try to open_visitors on a directory which has not yet had finalise_visitors called.

The implementation uses Marshal, and so all the usual provisos for that module apply here.