Posts Tagged ‘graph’

NJC: Day 3

Saturday, June 25th, 2011

I got an early start this day because I woke early for no real reason. So I headed into the office, had some breakfast there (grilled cheese with fried egg and coffee, thanks for asking) and got down to work.

That meant installing Evernote for OS X so I could attach a PDF to a note, then using the information I’d gathered there to write up a proposal concerning alerting. I am trying to be thorough in documenting what I do and why. I tried to capture my thinking, rationalize my decision, and foreshadow future developments. As part of the research for the writing I think I noticed something odd about Pingdom’s pricing.

I’m probably misunderstanding something. But if the costs per check aren’t different at the Business plan level, and I don’t care about SMS notifies, what is my incentive to ever leave the Basic plan? My efficient frontier seems like it’s up and to the left and with a linear progression, it’s a Basic ballgame.

If the cost per check on the business plan is less, then the graph is wrong and there is a break-even point on the expense of checks. But it sure doesn’t look like it.

I ran my proposal for additional monitoring past my boss and got his buy-in and then started deploying it. So that’s my first operational task which is not entirely reactive in nature; there had been an issue earlier with a system going away and no one noticing, but it wasn’t a production system and I was more interested in getting some kind of alerting going for those systems as they come online.

I fully expect to be iterating on the deployed monitoring solution, as it was a trade-off between results and costs (financial and my time/effort/brain) and there are arguably better solutions I didn’t feel like I could invest enough into at this point to get the most out of. It’s a starting point, an incremental improvement over what the company had in place before me.

This wasn’t quite a No Changes Friday (the only religious holiday I observe) but  it seemed worth it to push through that sabbath to get additional awareness of the environment. Arguably it didn’t impact anything production-related, beyond the tiny additional impact of the monitoring checks being done, which are all tickling network listening daemons.

Then I spent the rest of my day researching options for my next proposal, which will be SSL related.

The IO of Sauron

Tuesday, March 22nd, 2011

I needed to know what the disk utilization on a system was, essentially at all times, with a granularity of one second. Asking for the current iostat every five minutes via munin wasn’t sufficient. So I wrote this munin plugin. It tries to read a file and report the average and maximum disk utilization or util% logged in that file. Then it unlinks the file and starts an iostat running every second for 250 seconds, recorded in that file, setting it up for the next time it’s polled.

It looks like this. I don’t think I have any reason to obfuscate the vertical axis of this picture unlike the #devops people. It’s the graph from an uninteresting system doing some periodic nightly batch work.

munin graphs of the data from iosmart

It has not escaped me that it reports average averages, maximum averages, average maximums, and maximum maximums.

 

Here’s the script.

 


!/usr/bin/perl

# iosmart by Shannon Prickett <shannon.prickett@gmail.com>
# get iostat data and provide max and average values since the last check.

use strict;
use warnings;

use File::stat;

use vars qw{ $argument $iostat_file $iostat_runs };
use subs qw{ print_header };

$iostat_file = '/var/tmp/iosmart';
$iostat_runs = 250; # normally 250, reduce when testing

$argument = $ARGV[0] || 'NONE';

if ($argument =~ /config/) { # hint to munin
  # munin-run calls this to set up graphs, check limits
  print <<EOM;
graph_title Extended iostat coverage
graph_vlabel utilization
graph_category disk
graph_info Collect iostat data every second
sda_max.label sda max util
sda_avg.label sda avg util
sdb_max.label sdb max util
sdb_avg.label sdb avg util
EOM
}
else { # do the work

  if (( -e $iostat_file) &&  # it exists
    ( -r $iostat_file) &&  # we can read it
    ( -s $iostat_file) ) {  # it has something in it

    my $st = stat( $iostat_file ) or
      die "can't stat $iostat_file: $!\n";

    my $mtime = $st->mtime;
    my $now = time( );
    my $delta = $now - $mtime;

    if ( $delta > 300 ) {  # it's been >5 minutes
      print "${iostat_file} is stale, using it anyway\n";
    }

    open( my $fh, '<', $iostat_file ) or
      die "failed to open $iostat_file: $!\n";
  
    my %devices;
    my ($max, $skip_until, $stop_skipping);
    READLOOP: while ( my $line = <$fh> ) {
      next READLOOP unless ( $line =~ /^(sd\w).*?(\d+\.\d+)$/ ); # we only care about the disk rows

      # we want to skip the first block of iostat output
      # that's output since boot. we only care about now.
      unless ( defined $skip_until ) {
        $skip_until = $1;
        next READLOOP;
      }

      my $device = $1;  # per device values
      my $util = $2;  # keep the number from the last dolume

      if (( ! defined $stop_skipping ) && # if we are still skipping
        ( $device ne $skip_until ) ) {  # this isn't what we're looking for
        next READLOOP;
      }
      else {
        $stop_skipping = 1;
      }
    
      chomp $util;  # we hates filthy newlines forever
    
      if (( ! exists $devices{$device}{'count'} ) or  # first time
        ( ! defined $devices{$device}{'count'} ) ) {
        $devices{$device}{'count'} = 1;
      }
      else {
        $devices{$device}{'count'} = $devices{$device}{'count'} + 1;
      }

      $devices{$device}{'sum'} += $util;

      if (( ! exists $devices{$device}{'max'} ) or  # we've never set it before
        ( ! defined $devices{$device}{'max'} ) or  # the whole world is crazy
        ( $devices{$device}{'max'} < $util ) ) {  # it's smaller than current
        $devices{$device}{'max'} = $util;
      }

    }

    for my $device (keys %devices) {
      $devices{$device}{'average'} = $devices{$device}{'sum'} / $devices{$device}{'count'};

      print "${device}_max.value $devices{$device}{'max'}\n";
      print "${device}_avg.value $devices{$device}{'average'}\n";
    }

    close( $fh ) or die "can't close ${iostat_file}? wtf? $!\n";

    unlink( $iostat_file ) or die "can't rm ${iostat_file}: $!\n";

    print_header( );

    # suppress an error about inappropriate ioctl by not testing exit. :(
    system( "iostat -xd 1 $iostat_runs >> $iostat_file &" );
  }
  else {
    print "can't use ${iostat_file}; making new\n";
    open( my $fresh, '>', $iostat_file ) or
      die "can't make new ${iostat_file}\n";
    print_header( );
  }
}

sub print_header {
  open( my $header, '>>', $iostat_file ) or
    die "failed to start ${iostat_file}: $!\n";

  print $header "# this file is created by the iosmart munin-plugin\n";
  print $header "# if it's not updating, check the munin-node logs\n";

  close( $header ) or die "can't stop heading ${iostat_file}? HMPH $!\n";
}

ETA: using a new code formatting plugin

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!

Visit our friends!

A few highly recommended friends...