I needed to install awstats into an existing web installation recently, and finding the info needed for that was a bit annoying. The documentation I could find gets into the nitty gritty without giving you the big picture.
So here is the big picture for awstats. Because it is meant to be a “big picture”, I’m putting the configuration discusson last. I want to cover the overall view of how the system works before getting into configuration specifics.
Overview for awstats
awstats is a script for analyzing web server logs (it has been extended to analyze other types of logs like mail logs). It analyzes the logs, and stores the statistics, and you can see the results as graphs and charts on a web page. It is a venerable old tool (meaning it doesn’t quite fit into modern ways of handling log files, init scripts, script parameters or whatever), and also is designed to be lean so it can analyze quite large logfiles without bogging down the whole system (so the parser for the log lines is a bit simple and can get confused — this just means that line is thrown away but the rest of the file does get processed).
awstats.pl is a perl script. On my Debian system it got installed into /usr/lib/cgi-bin/awstats.pl. It can run as a cgi-bin script, but doesn’t have to.
After configuring, you use it in two stages:
- analyze the web server logs
- generate the results page.
In stage 1, you run
awstats.pl -update on the log file. This will produce a bunch of .txt files. There will be a .txt file for each time period (usually a month, but could be a year). There will generally be a .txt file series for each set of logfiles for a domain or virtualhost. If one log file spans two calendar months (say, covers Jan 28 — Feb 3), then it will produce two .txt files — one for January and one for February. When you process the next logfile (that might span Feb 3 — Feb 10), then no new .txt files will be created but the existing one for February will be updated.
Generally, the documentation assumes you will not be trying to “catch up” with your old log files. If you want to run your old log files through awstats, you will need to analyze them in chronological order, as awstats.pl is meant to run on the same logfiles over and over, and only process the new items since last session. It does this by storing a date and comparing each log record to the date to find out if it is old or new. I wrote a script that processes all the old log files in order (catchup.py).
Also, as far as I know, awstats doesn’t understand compressed files, so you will have to uncompress the logfiles before analyzing them. My script handles that too, but for that it needs write permission in the logfile directory.
The “chronological ordering” requirement implies that all the things that log to that log file better agree on the time. If one app is logging in local time (say -0500) and another is logging in UTC time, then generally only the records that are 5 hours later will be picked up by awstats.pl. The other records will be regarded as “corrupted” and ignored.
You can run this stage as a cgi script — but it can also be run by a cron job. Running it as a cron job means you don’t have to give your web server user permission to write to its DocumentRoot. Running it as a cgi script means you can see the very latest statistics (right up to the moment before you run the update) — but if you don’t do it often enough, you may miss analyzing some of the web server logs (eg, if they get rotated before you run awstats.pl on them). If that happens you have the relatively painful task of trying to fix the mess, or just abandoning stats for those months. You could run it as a cron job and still allow web users to run it as well, to avoid losing info when logs are rotated.
Once you have the web server logs digested into statistics in .txt files, then you can view the results. There are two ways to view the results:
- dynamically, via a cgi script
- statically, as pre-generated static html pages
To see the results dynamically, you need to configure your web server to call the cgi script.
To see the results statically, you need to make a place for the generated html, and then call
awstats.pl -output for each report you might want to have available. There are quite a few reports, and you need to do it for each time period as well. awstats supplies a script (in my Debian system it went here:
/usr/share/awstats/tools/awstats_buildstaticpages.pl) that will generate all the reports for a given time period (i.e. month) so you just have to loop over the months. And virtualhosts, if you’re doing it for more than one web server/domain name.
There are two things to configure with awstats: one is awstats itself (a config file for each “web site”) and one is the web server that you will use to view the results (if that is how you are going to view the results). Below, I discuss only configuration of awstats itself.
The awstats.pl script is configured with files in /etc/awstats/awstats.domainname.conf (again, this is for my Debian system). You would copy the awstats example conf file to a file with your domain name in the middle, eg:
cp /etc/awstats/awstats.conf /etc/awstats/awstats.sourcerer.ca.conf
And then edit the file to have the configuration you want.
awstats works best if you have a separate series of web server logfiles for each host for which you want graphs. If you have some virtualhosts, you might want to configure them each to have their own log files.
On my Debian system using apache2 for a web server, all the log files go into the same directory
/var/log/apache2. The catchup.py script can handle this — and it would be easy to make a set of cron commands that will each update a different virtual host. At the moment, I have all the stats files and static html files going into one directory — one for stats, one for all the static html files. Maybe I should have a directory per virtualhost for the html files, though — they are getting quite numerous. A directory per virtualhost means you can more easily apply different access policies to the different domains.
The things I changed in the awstats.conf file for my purposes were:
LogFile LogFormat SiteDomain HostAliases DirData
There are lots of other options, but customizing those was enough to get some charts to start wtih.
LogFile is used if you don’t specify -LogFile on the command line. The catchup script uses the -LogFile argument on the command line, but the cron jobs that keep the stats updated can probably use the most recent logfile name domain-access.log.
LogFormat — it’s important to match the LogFormat to the actual format that your logfiles are written in, or every line will be classified as corrupted. I used format 1 for my apache2 logfiles. awstats has 4 predefined log formats, or you can specify a custom log format in exquisite detail field by field.
SiteDomain is the name of your site as your web server knows it
HostAliases is a list of other names for “self” for the web server (for the domain being analyzed).
DirData is the directory where the statistical output will go (all the .txt files).
The web sites I administer have hired a service to monitor themselves, and I added those user agents to the robots file (
/usr/share/awstats/lib/robots.pm) in order to count them. Adding them to SkipHosts just meant they weren’t counted and didn’t show up in the stats at all.
Hopefully that will give you an idea of what you’re aiming for as you follow the other, more detailed explanations of how to set up awstats. Remember, when you lose the data in a logfile and have to leave it behind — it’s only stats. You’ll manage without them. The stats are approximate anyway — very little attempt is made in the program to give an exact account of the activity. Records are thrown away and not counted almost every time awstats is run — so don’t sweat it if you lose a log file or two on the way. Once the cron jobs are set up and time passes, you’ll get fairly good coverage of the activity of the web server. Keep tuning your web app (eg, ensure that times logged use the same timezone across all apps), look at the config options for awstats and tune your .conf files for more interesting reports, and eventually you’ll have a great resource for security monitoring, marketing analysis and for web site usability and effectiveness reviews.
In fact, you probably can just set up awstats and let it accumulate statistics over time — don’t bother with the catchup script. I did it because I did have old logs, and I wanted to see what a year’s worth of web log statistics looked like in awstats and other packages. It did help me with choosing a web log analysis package, and in choosing among the various extra options for awstats not discussed above — but it was also time-consuming.