module Weblogs:Import and parse log files, detect search engines, etc. Copyright (C) 2005-2006 Merjis Ltd (http://www.merjis.com/). $Id: Weblogs.html,v 1.4 2006/07/19 09:44:38 rich Exp $sig..end
type row = {
|
src_ip : |
(* | Remote IP address. | *) |
|
remote_username : |
(* | HTTP authentication username. | *) |
|
username : |
(* | Username field. | *) |
|
t : |
(* | Date and time of record. | *) |
|
http_method : |
(* | HTTP method. | *) |
|
full_url : |
(* | Complete URL and query string. | *) |
|
url : |
(* | URL only. | *) |
|
qs : |
(* | Query string, unparsed. | *) |
|
args : |
(* | Query string, parsed. | *) |
|
http_version : |
(* | "HTTP/x.y" | *) |
|
rcode : |
(* | HTTP response code. | *) |
|
size : |
(* | Response size. | *) |
|
referer : |
(* | HTTP Referer field. | *) |
|
user_agent : |
(* | HTTP User-Agent field. | *) |
|
clickstream : |
(* | User-tracking cookie - see note. | *) |
|
server_sitename : |
(* | Server sitename (IIS). | *) |
|
server_computername : |
(* | Server name (IIS). | *) |
|
server_ip : |
(* | Server IP (IIS). | *) |
|
server_port : |
(* | Server port (IIS). | *) |
|
time_taken : |
(* | Time taken (IIS). | *) |
|
proxy : |
(* | Proxy - see note. | *) |
|
original : |
(* | Original webserver log line. | *) |
|
filename : |
(* | Original source filename. | *) |
|
lineno : |
(* | Original source line. | *) |
The clickstream field either comes from the user-tracking cookie
if you have one, or else is synthesized by hashing the user-agent
and the proxy or source IP addresses together. The clickstream
field can be used to provide a semi-reliable method of detecting
visitors.
Some large proxy services (notably AOL) generate requests for a single
user from a round robin of different IP addresses. To accomodate these
users, we map the actual src_ip to the proxy name proxy. If the
src_ip doesn't match any of the known proxies, then proxy will be
None. The proxy, if known, will be used for clickstream
creation, instead of the src_ip.
See also: http://webmaster.info.aol.com/proxyinfo.html and the
file weblogs_proxies.ml.
type http_method =
| |
GET |
|||
| |
POST |
|||
| |
HEAD |
|||
| |
OTHER of |
(* | HTTP methods. | *) |
val string_of_row : row -> stringrow.original field).typet =row list
val import_file : ?filter:(row -> bool) -> string -> t
The optional ~filter parameter may be given to apply arbitrary
filtering to rows which are loaded into memory - for example to
only load rows between certain dates.
val sort : t -> tval exclude_local : ?filename:string -> t -> t$HOME/.weblogs.exclude which should contain one IP
address per line. (It's often a good idea to put 127.0.0.1
in this file). Then call this function on your logfile rows.
All IP addresses which appear in the file are deleted from the
rows returned.
You can override the default filename ($HOME/.weblogs.exclude)
by supplying an optional ~filename parameter.
type referer_class =
| |
KnownSearchEngine of |
|||
| |
KnownDirectory of |
|||
| |
KnownEmailService of |
|||
| |
Other |
(* | Class of a referer:
See also: | *) |
val referer_class : row -> referer_classReferer field which is
set by the client and may be spoofed. (3) A given record might
conceivably fall into several different classes.
See also: Weblogs.normalise_query
val is_known_bot : row -> (string * float) optionUser-Agent field and the IP
address, see if this is a known bot.
Return None if not, or Some (name, confidence) where
name is the name of the bot.
The float value returned is the degree of confidence in the
positive result, between 0. (no confidence at all) and 1.
(certainty). The currently defined levels are:
0.9 - User-Agent and IP address are both consistent with a
known bot.
0.5 - User-Agent indicates a known bot, but from an unexpected
IP address.
val is_web_browser : row -> string optionUser-Agent field, see if the
record corresponds to a known web browser.
Return None if not, or Some name where name is the name of
the browser.
The current list of web browsers is very incomplete. See
web_browsers.csv.
val normalise_query : string -> stringWhat this currently does:
Convert the string to lowercase in a UTF-8-sensitive way.
Removes whitespace at the beginning and end of the query.
Converts any whitespace in the middle of the query into a single space.
What it should do in future, but doesn't do right now:
UTF-8 normalisation.
Map alternate codepoints to normal forms (eg. Japanese single width katakana to ordinary katakana).
See also: Weblogs.referer_class
typevisitor =row list
val detect_visitors : t -> visitor listval import_rows : string -> (row -> unit) -> unitimport_rows filename f imports the logfile one row at a time,
passing each row to function f for processing.
The advantage is that because the file is not loaded into memory
this function can handle very large logfiles. The disadvantage
is that sorting and visitor detection are hard-to-impossible,
but see Weblogs.start_visitors below.
type visitors_handle
val start_visitors : ?directory:string -> unit -> visitors_handleval open_visitors : ?directory:string -> unit -> visitors_handleval import_visitor_row : visitors_handle -> row -> unitval finalise_visitors : visitors_handle -> unitval count_visitors : visitors_handle -> intval iter_visitors : visitors_handle -> (visitor -> unit) -> unitWeblogs.import_file
followed by Weblogs.detect_visitors, but it can handle
pretty much unlimited amounts of data because it uses the disk
as an intermediate store.
To import log files and perform visitor analysis on them at the same time, do:
let vh = start_visitors () in import_rows filename (import_visitor_row vh); finalise_visitors vh;
(You only need to do this import step once for a given log file.
After that you can use open_visitors () to get the visitors_handle
even from another program).
You can then use count_visitors vh to find the number of visitors
or iter_visitors vh f to iterate function f over the visitors.
It is implemented by creating lots of disk files, one per visitor.
These are placed in a directory, normally $TMPDIR/visitors
or /tmp/visitors, but which you can control by passing the
~directory option to start_visitors or open_visitors.
Note that the data stored in visitors_handle doesn't give a complete
picture of the contents of the original log files. In particular
bot entries and some other bot-like rows will be discarded.
finalise_visitors is important. Do not try to open_visitors on
a directory which has not yet had finalise_visitors called.
The implementation uses Marshal, and so all the usual provisos
for that module apply here.