Log analysis of my website
I write these essays in part as a promotional activity. I'm a consultant, and expect people to find out more about what I do through reading what I've written.
I've wondered if it's been useful, but have put off doing the analysis of my website. At first it was because I didn't have enough essays to do interpretable analysis. And then I just put it off. At the German Chemoinformatics Conference I talked to quite a few people, mostly grad students, who had gotten information from my site. That was enough to make me finally do some analysis.
I used awstats, chosen based on doing some web searches. I wanted something that could analyze my Apache logs and could generate static pages. There are other tools but since awstats did what I wanted I didn't try anything else.
So far this year I've had 1.1 million "hits", which corresponds to 330,000 page views. A "hit" includes images, so a page view can have multiple hits because of CSS, images, and other embedded content. Another nearly 500,000 page views comes from web spiders and other identifiably non-people requests. More page requests from robots than people. All told, I use less than 20GB bandwidth per year. I use pair Networks for my hosting. My basic account allows 400GB/month of transfer. I'm not even close.
Of the robots, Yahoo Slurp pulled down 1.6 GB, MSNBot 810 MB and and Googlebot 290MB. 80MB for Google's RSS reader, 7MB from Bloglines and 5MB from UniversalFeedParser. Of the users, 64.5% use Windows, 17% use Linux, 11.5% use Macs, and jumping over the BSD and Solaris users, a full 88 requests came from an IRIX machine. The browser stats are 45% Firefox, 33.5% IE, 4% each Mozilla and Firefox, 3% Opera.
Top hit (no surprise) is my RSS feed, viewed 82,000 times this year. Including by aggregators so translate as you wish. Next was my LOLPython page, which wasn't a surprise. I wrote it deliberately because of the then high popularity of lolcats and lolcode. It got 17,500 views. About 1,200 downloads from people who weren't me.
Going further down the list:
- naming molecules is the first chemistry page, at 4,300 hits. I think because it uses the word "vodka"
- my wide finder commentary is only a few months old and is #11 position with 4,200 hits. Basking in Tim Bray's shadow.
- 3,200 people viewed this slide. Why? People searching for "sample use case". But it's an image - how do the search engines know about it?
- the ANTLR work I did is also popular. Only 50 days old and 2,500 hits. Well, it was on the ANTLR home page for a while.
You can easily see that most people who come to my pages are there because of popular topics of the day (LOLPython, wide-finder) or general computing questions (threading, validation, HTML templates, Python, ANTLR). Very few came to my pages for cheminfomatics reasons. Then again, there are very few people doing cheminformatics.
The top search phrases were:
- python basics - 2,200
- screen scraping - 1,600
- python trace - 1,000
- naming molecules - 1,000
- sample use case - 809
- use case sample - 610
- pyrssgen - 600
- sample use cases - 580
- boa constructor - 510 (that's a very old review of mine)
- lolpython - 500
60% of the pages come from "direct address or bookmarks". 31% came from search engines, and 10% from referrers. The top being lolcode.com, then Pythonware's Daily-URL (probably lolpython), with the already mentioned wide-finder (via the effbot) and ANTLR home page. programming.reddit.com linked to my lolpython page, and the matplotlib cookbook links to my page showing how to use matplotlib without a GUI.
Lastly, hostname analysis. Who is 220.127.116.11? That's registered to the RCN Corporation and resolved at 207-172-151-225.c3-0.gth-ubr1.lnh-gth.md.cable.rcn.com. They sucked down 780 MB of my 20GB. All to read my RSS file every hour. Whoever it is doesn't know to how to ask for an If-Modified-Since as they are downloading the entire thing (usually unchanged) every time. How do I complain?
The next hog is NewsAlloy through 18.104.22.168 which has downloaded 450 MB, and makes full requests every 20 minutes. I emailed them this:
Your RSS reader at 22.214.171.124 , identified as "NewsAlloy/1.1 (http://www.NewsAlloy.com; 1 subscribers)" is taking up 5% of my upload bandwidth. While that's only 400MB/year, the underlying reason is because your service doesn't send the tags needed to handle HTTP conditional get. My server should only need to return a 304 Not Modified for most cases, rather than the 200 Ok (along with over 100K of content). You poll every 20 minutes, so that adds up.I admit: I do this partially to see what happens. I got an answer within a few hours. They said it shouldn't have happened and asked for more details. Looking into it further I see that whever subscribed via their service unsubscribed a few months ago. NewsAlloy hadn't made a request since then.
You would decrease your bandwidth use by quite a bit - perhaps an order of magnitude - by adding support for conditional GET requests. See for example: http://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers .
I don't know who uses NewsAlloy. I will say that they had very responsive service.
Next on the list, at only 6MB is my ISP. This is me checking things on my server, and my home page is my web site. After that is a friend (I recognized the domain name) at 4MB. He's configured his RSS reader to poll every 30 minutes.
Looking for hosts in my field, I see 2,000 requests hits from a biotech in England. Ah-ha, it's one person, reading this from a machine with "Windows-RSS-Platform/1.0 (MSIE 7.0; Windows NT 5.1)". Hi!
There are 700 page requests from the rest of pharma. 200 from one site (all through Google searches finding my PyDaylight work) and 100 from another site.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2013 Andrew Dalke Scientific AB