Fine Grained Log Analysis

OurWork Edit-chalk-10bo12.png

Related

What (summary)

Analyze apache logs and generate different statistics in near realtime. We would like to know about the most visited pages, number of edits done by registered users, number of anonymous edits, etc so that we can make changes to how the site works and quickly see how those changes affect our AdoptionMap.

  • I feel that for the future, it would be wise to develop a logging solution that doesn't rely on Apache log files. I'd be willing to discuss some proposed solutions, along with the experience i have had so far analyzing Apache's log files. Naturally, for retroactive data this is our only choice.-Stephen Judkins

Why this is important

We are developing a model of how users progress to deeper and deeper levels of engagement with AboutUs. As we make changes to how the site works, we want to be able to predict and then measure the effects of the changes so that we can become wizards at optimizing how engaging and sticky the site is. This is called A/B testing where some users are presented with the new thing (the A group) and some are presented with the original thing (the B group) so that we can isolate and measure the exact effect of even very small tweaks.

Many of the interactions within the model are a bit difficult to visualize from standard log analysis so the ability to run custom queries over our log files is essential.

DoneDone

  • There is a page that reports "Unique Visitors" over some time period. The page takes start-time and end-time arguments and provides a report that is at most 1 day stale.
  • There is a page that reports "Anonymous Editors" over some time period. The page takes start-time and end-time arguments and provides a report that is at most 1 day stale.


  • Analyzing Apache log files would be an extraordinarily difficult and roundabout way to measure "anonymous editors" or any statistic about how our Wiki was modified. What we need, in my opinion, is a method to easily cross-reference page visits with page_ids from our database.
  • Also, we should talk to Julia regarding the information that Madrona and other VCs want.

-Stephen Judkins

  • I made this work using logsnarf, it wasn't so bad. - Jason Parmer

NeedsAcceptanceTest

Steps

  • Create a script that can instantiate an ec2 machine and configure it for log analysis and reporting. Use ssh to install git, rails, etc. Set a timer to shutdown if unused for a week.
  • Create a rails script that can manage feature extraction from logs. Feature computations will be pulled from a well known git repository. Computations run will be logged in s3. It is not required at this time to distribute work among additional ec2 instances, though that is an eventual goal. Output extracted features (Unique Visitors, Anonymous Editors, etc) to appropriately labeled s3 objects.
  • Create a rails/javascript page that can browse precomputed features from s3 objects. Assume data is in some well known format such as json.
  • Review the log analysis process for compliance with AboutUs privacy policy.

Query, and Sysop Visible Info

Ability to do keyword queries and get a list of transitions people make

FROM                               TO                                 COUNT
---------------------------------  ---------------------------------  ---------
Wiki                               LASIK                              7884
LASIK                              Portal:LASIK/QualifyingForSurgery   254
LASIK                              exit                                650

Eyeball Time

Track how long users stay at particular pages so that we can tell how well we are doing at delivering the content they are looking for.

Discussion

I gotta say that having written logsnarf I think a lot of this stuff is way overkill. We can get the answers we want in 6 hours right now, I think a great first step would be splitting up logsnarf runs on the cluster and get the time down to 45 minutes. At that point we should revisit c/b for this project. Jason Parmer 16:57, 4 September 2007 (PDT)

Saw this tool recently and though I'd add it here for future reference. -- Ward 09:41, 3 April 2008 (PDT)