Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2007/12/23/navel_gazing

Log analysis of my website

I write these essays in part as a promotional activity. I'm a consultant, and expect people to find out more about what I do through reading what I've written.

I've wondered if it's been useful, but have put off doing the analysis of my website. At first it was because I didn't have enough essays to do interpretable analysis. And then I just put it off. At the German Chemoinformatics Conference I talked to quite a few people, mostly grad students, who had gotten information from my site. That was enough to make me finally do some analysis.

I used awstats, chosen based on doing some web searches. I wanted something that could analyze my Apache logs and could generate static pages. There are other tools but since awstats did what I wanted I didn't try anything else.

So far this year I've had 1.1 million "hits", which corresponds to 330,000 page views. A "hit" includes images, so a page view can have multiple hits because of CSS, images, and other embedded content. Another nearly 500,000 page views comes from web spiders and other identifiably non-people requests. More page requests from robots than people. All told, I use less than 20GB bandwidth per year. I use pair Networks for my hosting. My basic account allows 400GB/month of transfer. I'm not even close.

Of the robots, Yahoo Slurp pulled down 1.6 GB, MSNBot 810 MB and and Googlebot 290MB. 80MB for Google's RSS reader, 7MB from Bloglines and 5MB from UniversalFeedParser. Of the users, 64.5% use Windows, 17% use Linux, 11.5% use Macs, and jumping over the BSD and Solaris users, a full 88 requests came from an IRIX machine. The browser stats are 45% Firefox, 33.5% IE, 4% each Mozilla and Firefox, 3% Opera.

Top hit (no surprise) is my RSS feed, viewed 82,000 times this year. Including by aggregators so translate as you wish. Next was my LOLPython page, which wasn't a surprise. I wrote it deliberately because of the then high popularity of lolcats and lolcode. It got 17,500 views. About 1,200 downloads from people who weren't me.

The next two were surprising. I did a series of lectures for the NBN. These were for the most part graduate students in biology, going into computational biology, who needed more programming training. The page on Javascript validation got 7,300 hits and on threads in Python, with 5,800. My screen scraping was also popular, at 5,600 views.

Going further down the list:

I do a lot of work with cheminformatics, but that's the details. In most cases my topic is more general, like how to write a C extension for Python (that just happens to use a chemistry toolkit). The highest cheminformatics specific hit is my article on SMILES tokenization, with 1,500 hits. Most of the links come from Wikipedia's SMILES page. My most popular bioinformatics page is on BLAST parsing at just under 1,400 hits.

You can easily see that most people who come to my pages are there because of popular topics of the day (LOLPython, wide-finder) or general computing questions (threading, validation, HTML templates, Python, ANTLR). Very few came to my pages for cheminfomatics reasons. Then again, there are very few people doing cheminformatics.

The top search phrases were:

Yes folks, 2,000 people came to my site for one image I have of a use case, from a 10 minute presentation I gave at a bioinformatics conference trying to convince people that usability analysis is important. I don't think it had any effect. No one came to my site searching for information on OEChem.

60% of the pages come from "direct address or bookmarks". 31% came from search engines, and 10% from referrers. The top being lolcode.com, then Pythonware's Daily-URL (probably lolpython), with the already mentioned wide-finder (via the effbot) and ANTLR home page. programming.reddit.com linked to my lolpython page, and the matplotlib cookbook links to my page showing how to use matplotlib without a GUI.

Lastly, hostname analysis. Who is 207.172.151.225? That's registered to the RCN Corporation and resolved at 207-172-151-225.c3-0.gth-ubr1.lnh-gth.md.cable.rcn.com. They sucked down 780 MB of my 20GB. All to read my RSS file every hour. Whoever it is doesn't know to how to ask for an If-Modified-Since as they are downloading the entire thing (usually unchanged) every time. How do I complain?

The next hog is NewsAlloy through 207.230.13.10 which has downloaded 450 MB, and makes full requests every 20 minutes. I emailed them this:

Your RSS reader at 207.230.13.10 , identified as "NewsAlloy/1.1 (http://www.NewsAlloy.com; 1 subscribers)" is taking up 5% of my upload bandwidth. While that's only 400MB/year, the underlying reason is because your service doesn't send the tags needed to handle HTTP conditional get. My server should only need to return a 304 Not Modified for most cases, rather than the 200 Ok (along with over 100K of content). You poll every 20 minutes, so that adds up.

You would decrease your bandwidth use by quite a bit - perhaps an order of magnitude - by adding support for conditional GET requests. See for example: http://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers .
I admit: I do this partially to see what happens. I got an answer within a few hours. They said it shouldn't have happened and asked for more details. Looking into it further I see that whever subscribed via their service unsubscribed a few months ago. NewsAlloy hadn't made a request since then.

I don't know who uses NewsAlloy. I will say that they had very responsive service.

Next on the list, at only 6MB is my ISP. This is me checking things on my server, and my home page is my web site. After that is a friend (I recognized the domain name) at 4MB. He's configured his RSS reader to poll every 30 minutes.

Looking for hosts in my field, I see 2,000 requests hits from a biotech in England. Ah-ha, it's one person, reading this from a machine with "Windows-RSS-Platform/1.0 (MSIE 7.0; Windows NT 5.1)". Hi!

There are 700 page requests from the rest of pharma. 200 from one site (all through Google searches finding my PyDaylight work) and 100 from another site.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2020 Andrew Dalke Scientific AB