All Projects → siebenmann → cks-dtrace

siebenmann / cks-dtrace

Licence: other
DTrace scripts we use at CSLab

Programming Languages

DTrace
51 projects
== Our DTrace scripts

These are DTrace scripts that we've found useful to troubleshoot things
in our Solaris + ZFS + iSCSI fileserver environment; you can find some
details of that environment at

    http://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSFileserverSetup

These scripts are for OmniOS r151008j or so (as of 2014-05-15) and thus
more or less current Illumos, and may not work on any later or earlier
version. They work for us but as usual there are no warranties. I make
no claims that they are general about what they report; we wrote them
to find out the information that we needed at the time.

Please note that some of these scripts contain local assumptions and
things that are very specific to our environment here. You definitely
want to read the start of the scripts before using them blindly.

=== Using the scripts

There are four different things you can use these scripts for, depending
on what you need.

nfs3-mon.d and zfs-mon.d provide top-like periodic reports on NFS
v3 server activity and ZFS activity, suitable for running either when
there's a load problem or you just want to know generally what sort of
things are going on. nfs3-mon.d is focused on the sources of client
activity so you can zero in on where load is coming from. zfs-mon.d
focuses on what pools are active and how.

nfs3-stats.d and iscsi-stats.d gather aggregate timing and activity
information over however long you run them for NFS v3 server and iSCSI
initiator activity. Their goal is to characterize how your IO looks like
and how things seem to be performing; what activities are fast or slow,
and how things break down. These two scripts were developed first and
are the most primitive and peculiar.

nfs3-long.d and iscsi-long.d provide immediate reports of long operations
at the NFS v3 server or iSCSI initiator level. You would run these if
you think (or know) that you have unusually slow operations and want to
know what they look like and how often they happen. Right now the NFS
stuff is focused on reads, writes, and commits (although it reports on
most ops); the iSCSI initiator stuff looks at all SCSI iSCSI operations.

zfstrace.d reports every ZFS, ZIO, and iSCSI initiator operation, with
duration and other information. This isn't the kind of thing you read
directly, live; instead it's the kind of thing you feed to detailed
analysis that needs full information about each event. For example,
working out your 99% performance point or graphing the distribution
of IO times.

In addition, fdrwmon.d will report per-FD read/write IO activity for
a particular process, with a bunch of limitations based on what system
calls it currently monitors.

=== Potential errors when using the scripts

[NOTE: this information is probably obsolete for OmniOS. It's retained
for historical notes and because I don't have the time to thoroughly
revise this README right now.]

Suppose that you get an error like:
	nfs3-mon.d: line 63: probe description fbt::rfs3_read:entry does not match any probes

If you're on a Solaris 10 machine, what's probably going is that you
haven't exported any filesystems via NFS (or perhaps set up any iSCSI
stuff). Solaris does many things through loadable kernel modules,
including both NFS serving and iSCSI. These modules are loaded on
demand, when needed by things you're doing, and if the modules aren't
loaded any trace points they define aren't available (both fbt function
tracepoints and any general providers).

You can see a list of what kernel modules are loaded with modinfo.
In Solaris 10 update 8, the NFS server module is 'nfssrv (NFS server
module)' and the iSCSI initiator is 'iscsi (Sun iSCSI Initiator
v20100524-0)'.

Next, suppose you get an error like:
	dtrace: failed to compile script nfs3-mon.d: line 118: os is not a member of struct objset

What's happening here is that the contents of 'struct objset' changed
between Solaris 10 update 8 (what we're using and what these scripts
are written for) and later versions of Solaris. Look in the scripts for
comments about the ZNODE_TO_POOLNAME() macro and how to change it.

=== Script details

The current DTrace scripts here:

- fdrwmon.d: reports on read/write IO per-second bandwidth for a
  particular process on a per-file-descriptor basis. See the comments
  at the start of the script for how to run it and what information
  it prints. Currently only read(), write(), readv(), and writev()
  activity is tracked.

  (A thorough script would probably also look at least at the recv*()
  and send*() families of socket calls. We haven't needed this ourselves
  so far.)

- nfs3-stats.d: gathers timing information on NFS v3 server performance
  and ZFS performance (both high level and low level). This is mostly
  presented as distributions.

  - rfs3_* operations are NFS v3 server operations.
  - zfs_read and zfs_write are the high-level interface into ZFS, used
    by eg the NFS v3 server. They can hit the ARC instead of doing real
    IO.
  - zio_* operations are low level ZFS ZIO IO operations; as far as I
    know they come after the ARC. I don't really understand what zio_free
    does. In our environment, zio_ioctl operations are only used for
    sending disk flushes to the disks and this script ignores them for
    reasons beyond the scope of this README.

- iscsi-stats.d: gathers relatively low-level timing information on
  iSCSI initiator performance. This accumulates not just averages
  (which iostat will sort of tell you) but also distributions and
  breaks down the numbers in various ways.

- nfs3-long.d: report on long NFS v3 operations, primarily on read, write,
  and commit operations. Other operations have less details reported
  about them and the nominal filename and inode number associated with
  the operation needs operation-specific knowledge to interpret (eg,
  sometimes it is the directory the operation was done in, sometimes it
  is the thing the operation was working on, and so on).

- iscsi-long.d: report on long iSCSI SCSI operations.

- nfs3-mon.d: provides an every-10-seconds report of NFS v3 activity,
  primarily of read and write activity. The activity is broken down in
  various ways so that we/you can see who is causing it. See the comments
  at the start of the script for some information on what it reports.

- zfs-mon.d: provides an every-10-seconds report of ZFS activity, again
  primarily of read and write activity. See the comments at the start of
  the script for information on what the reports mean.

- zfstrace.d: provide timing traces of zfs_read()/zfs_write(), ZIO
  operations, and iSCSI operations; every operation is reported with
  the time it took and various other details. The output is cryptic
  and compact due to volume and speed concerns.

Understanding some of the details of this information requires some
knowledge of (Open)Solaris kernel internals, but I think that much of it
will be relatively obvious.

Right now we ignore the IO sizes and make no attempt to, eg, normalize
the operation durations by the amount of data they were working on. In
our environment this works out fine, but it may not in yours.

Note that one zfs_read() or zfs_write() will generally result in
multiple zio_* operations. Some of these operations are synchronous and
some are not (eg, readahead or writeback). nfs3-dtrace.d makes some
attempt to only really track synchronous operations that delay the
completion of the higher-level IO but is probably not perfect about it.

ZIO operations seem to bubble down (and up) through the layers of vdevs
involved in any particular ZFS pool. We attempt to only follow ZIO
operations directed against physical disks instead of the intermediate
vdevs (because this is really the useful level to report ZIOs) but our
handling of this may not be complete. Also, this has only been tested
in a setup with mirror vdevs and there may be things I don't know about
ZIO against raidz vdevs.

(Chris Siebenmann, November 23 2012/May 15 2014 for OmniOS revisions)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].