szaydel
6/10/2015 - 7:59 PM

Netatalk performance bottleneck observations

Netatalk performance bottleneck observations

dtrace -qn '
  inline int FXATTR = 0x4000;
  
  fbt::copen:entry /pid == 20595 / {
    self->p = copyinstr(arg1);
    self->fmode = (args[2] & FXATTR) == FXATTR ? "xattr" : "normal";
  }
  fbt::copen:return /self->p == "com.sam:testing" / {
    printf("%s open type: %s retcode: %d\n", self->p, self->fmode, args[1]);
    self->fmode = 0;
  }'
pwrite64 -- IO happens in 4K and 128K sizes 
Should determine if 4K IO should be made larger, reducing number of calls to OS


Time from start to end of this routine varies a lot. It would be great to figure out why we have such variation. It looks like we have a binomial distribution with peaks at 1 microsecond and 32 microseconds, with a distant third at ~260 microseconds. I would have expected this to be closer, since the buffersize is unchanging at 1MB.

# dtrace -n 'pid$target::dsi_write:entry { self->bs = arg2; self->ts = timestamp; } pid$target::dsi_write:return {@[self->bs] = quantize(timestamp - self->ts); self->ts = 0;}' -p 1645
dtrace: description 'pid$target::dsi_write:entry ' matched 2 probes
^C

          1048576
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@                          11539
            2048 |@@@@                                     3320
            4096 |                                         40
            8192 |                                         31
           16384 |@@                                       1271
           32768 |@@@@@@@@@@@@@@@                          11178
           65536 |@                                        697
          131072 |                                         61
          262144 |@@                                       1860
          524288 |                                         14
         1048576 |                                         2
         2097152 |                                         1
         4194304 |                                         0
         8388608 |                                         0
        16777216 |                                         0
        33554432 |                                         0
        67108864 |                                         2
       134217728 |                                         0


pwrite calls occur in loops which result in constantly varying size of IO to the filesystem. If we could issue IO in more standardized units, like 8K, 16K, etc., we would likely see some performance improvement and fewer system calls to do IO. Most IO issued appears to be an odd size: 253988. We can see below that pwrite64 is being called with all sorts of sizes, some of which I am sure are not optimal with ZFS underneath.

# dtrace -n 'syscall::pwrite64:entry /execname=="afpd"/{@[arg2] = sum(arg2);}'
dtrace: description 'syscall::pwrite64:entry ' matched 1 probe
^C

              436              436
             1000             1000
             1076             1076
             1336             1336
             1024             2048
             8136            24408
             8103            48618
           260808           260808
           261068           261068
           261144           261144
           261708           261708
           261120           522240
           254008           762024
           254041          1524246
             8156         15773704
           253988        490958804
    
dsi_stream_read is basically a wrapper around buf_read, which puts buf_read into a while loop. The sizes of IO will vary here as well. We are likely reading from RAM in some cases, but not all. Trying to limit IO to some fixed unit may improve this function as well. This appears to be one of the core routines.

dsi_write seems to take varying amout of time to complete which may just be a product of the test system being busier sometimes than at other times. This should be looked into though. In all cases dsi_write seems to do IO in 1MB chunks, but the time it takes to complete it seems to go from 1 to 512. There appear to be two peaks, one around 1-2 microseconds, and the other around 32-64 microseconds.

# dtrace -n 'pid$target::dsi_write:entry { self->bs = arg2; self->ts = timestamp; } pid$target::dsi_write:return {@[self->bs] = quantize((timestamp - self->ts)/1000); self->ts = 0;}' -p 1887
dtrace: description 'pid$target::dsi_write:entry ' matched 2 probes

^C

          1048576
           value  ------------- Distribution ------------- count
               0 |                                         0
               1 |@@@@@@@@@@                               25609
               2 |@@@@@@@@@                                21831
               4 |                                         1209
               8 |                                         67
              16 |@@                                       4397
              32 |@@@@@@@@                                 20451
              64 |@@@@                                     10933
             128 |@@@@                                     10191
             256 |                                         854
             512 |@                                        1464
            1024 |                                         741
            2048 |                                         277
            4096 |                                         41
            8192 |                                         2
           16384 |                                         0

dsi_stream_receive seems to mostly take under 10 microseconds, but there are instances where it takes a good bit longer than that, perhaps 10 X longer, or more. This may be expected behavior which has to do with the amount of data being transmitted.

# dtrace -n 'pid$target::dsi_stream_receive:entry { self->ts = timestamp; } pid$target::dsi_stream_receive:return {@ = quantize((timestamp - self->ts)/1000); self->ts = 0;}' -p 1887

dtrace: description 'pid$target::dsi_stream_receive:entry ' matched 2 probes

^C


           value  ------------- Distribution ------------- count
               0 |                                         0
               1 |                                         4
               2 |                                         17
               4 |@@@@@@@@@@@@@@@@@@                       10443
               8 |@@@@@@@@@@@@@@@@@                        10269
              16 |@                                        443
              32 |                                         238
              64 |                                         277
             128 |                                         259
             256 |@                                        368
             512 |@                                        364
            1024 |@                                        880
            2048 |                                         176
            4096 |                                         0
            8192 |                                         0
           16384 |                                         0
           32768 |                                         0
           65536 |                                         0
          131072 |                                         0
          262144 |                                         3
          524288 |                                         0

dtrace -n 'pid$target::dsi_stream_receive:entry { self->ts = timestamp; } pid$target::dsi_stream_receive:return {@ = quantize((timestamp - self->ts)/1000); self->ts = 0;}' -p 1887

# dtrace -n 'pid$target::dsi_write:entry { self->bs = arg2; self->ts = timestamp; } pid$target::dsi_write:return {@[self->bs] = quantize((timestamp - self->ts)/1000); self->ts = 0;}' -p 1887

# dtrace -n 'pid$target::pwrite64:entry { @[arg2] = count(); }' -p 1645
dtrace: description 'pid$target::pwrite64:entry ' matched 1 probe

# dtrace -n 'pid$target::pwritev:entry { this->x = (struct uio *)copyin(arg1, sizeof (struct uio)); @ = quantize(this->x->uio_resid); }' -p 1645

dtrace -n 'syscall::writev:return {this->x = ((struct iovec *)(arg1))->iov_len ; @ = quantize(this->x); }'