Ticket #340 (closed enhancement: fixed)
Metrics gathering
Reported by: | jdreed | Owned by: | broder |
---|---|---|---|
Priority: | normal | Milestone: | Fall 2009 Release |
Component: | -- | Keywords: | |
Cc: | Fixed in version: | ||
Upstream bug: |
Description
Per discussion at release team, we need a way to identify what Athena is being used for. To start with, we need to be able to differentiate between "Email", "Web", "Academic Software" (defined as some popular -thirdparty packages), and "Other". Possibly also "Writing Code".
Attachments
Change History
comment:2 Changed 15 years ago by broder
I'm coming around to using the connector instead of the other interfaces we've looked at. It allows us to see what programs are being run out of AFS as well as off the local system, which may be useful since alexp has expressed skepticism of the wrapper scripts' stats in the past.
Attached is the version of geofft's connector.c that I've been working with. It should do a good job of batching large numbers of fork/exec pairs happening in a small period of time, while still not allowing events to queue up for seconds worth of processing time. It's also only catching exec calls, and only printing the executable path. It probably does want to have better error handling. We may also want to consider filtering based on non-system UIDs (i.e. only print out programs with a UID >1000 or something)
The intent is to run this under a script in a higher-level language, which reads off of the program's stdout and handles the collection, batching, and submission.
The performance hit seems pretty miniscule - about 0.4% for Anders' degenerate fork bomb:
kid-icarus:~/src/moira broder$ time seq 5000 | xargs -I% true real 0m10.474s user 0m4.044s sys 0m5.684s kid-icarus:~/src/moira broder$ time seq 5000 | xargs -I% true real 0m10.518s user 0m4.788s sys 0m4.852s
comment:3 Changed 15 years ago by broder
geofft pointed out that apparently I need to be using recvfrom or else I'll get spurious wakeups, so new version attached.
comment:4 Changed 15 years ago by broder
- Status changed from new to proposed
I've added debathena-metrics to debathena-cluster and uploaded both to proposed.
I've also been in touch with Jonathon to verify that none of the various public syslog reports will include metrics data.
comment:5 Changed 15 years ago by broder
- Status changed from proposed to closed
- Resolution set to fixed
This has been moved into production, along with documentation on the privacy concerns: http://kb.mit.edu/confluence/x/TQlS
From the technical side, I have two tries at an implementation of this, using two Linux APIs, inotify and cn_proc. They're in /mit/geofft/debathena/inotify.c and /mit/geofft/debathena/connector.c respectively.
inotify watches a small number of directories we give it and reports back events on open, access, and close, so watching /bin and /usr/bin would be a first-order way to find what applications are used.
cn_proc, via the kernel's less-than-clearly-named "connector" netlink interface, reports on every process creation (fork), exec, and close, so we can pretty accurately track which processes exist and for how long. I prefer this method because this is the "right" API for this; it gives us better accuracy and better resilience to lost events (which can happen with either method).
I haven't yet looked at the significantly less technical side of this, which is to modify either of these programs to look for the processes we care about, categorize them, add up how long each one was run, and then submit this to some central server.