Netatalk performance bottleneck observations
dtrace -qn '
inline int FXATTR = 0x4000;
fbt::copen:entry /pid == 20595 / {
self->p = copyinstr(arg1);
self->fmode = (args[2] & FXATTR) == FXATTR ? "xattr" : "normal";
}
fbt::copen:return /self->p == "com.sam:testing" / {
printf("%s open type: %s retcode: %d\n", self->p, self->fmode, args[1]);
self->fmode = 0;
}'
pwrite64 -- IO happens in 4K and 128K sizes
Should determine if 4K IO should be made larger, reducing number of calls to OS
Time from start to end of this routine varies a lot. It would be great to figure out why we have such variation. It looks like we have a binomial distribution with peaks at 1 microsecond and 32 microseconds, with a distant third at ~260 microseconds. I would have expected this to be closer, since the buffersize is unchanging at 1MB.
# dtrace -n 'pid$target::dsi_write:entry { self->bs = arg2; self->ts = timestamp; } pid$target::dsi_write:return {@[self->bs] = quantize(timestamp - self->ts); self->ts = 0;}' -p 1645
dtrace: description 'pid$target::dsi_write:entry ' matched 2 probes
^C
1048576
value ------------- Distribution ------------- count
512 | 0
1024 |@@@@@@@@@@@@@@@ 11539
2048 |@@@@ 3320
4096 | 40
8192 | 31
16384 |@@ 1271
32768 |@@@@@@@@@@@@@@@ 11178
65536 |@ 697
131072 | 61
262144 |@@ 1860
524288 | 14
1048576 | 2
2097152 | 1
4194304 | 0
8388608 | 0
16777216 | 0
33554432 | 0
67108864 | 2
134217728 | 0
pwrite calls occur in loops which result in constantly varying size of IO to the filesystem. If we could issue IO in more standardized units, like 8K, 16K, etc., we would likely see some performance improvement and fewer system calls to do IO. Most IO issued appears to be an odd size: 253988. We can see below that pwrite64 is being called with all sorts of sizes, some of which I am sure are not optimal with ZFS underneath.
# dtrace -n 'syscall::pwrite64:entry /execname=="afpd"/{@[arg2] = sum(arg2);}'
dtrace: description 'syscall::pwrite64:entry ' matched 1 probe
^C
436 436
1000 1000
1076 1076
1336 1336
1024 2048
8136 24408
8103 48618
260808 260808
261068 261068
261144 261144
261708 261708
261120 522240
254008 762024
254041 1524246
8156 15773704
253988 490958804
dsi_stream_read is basically a wrapper around buf_read, which puts buf_read into a while loop. The sizes of IO will vary here as well. We are likely reading from RAM in some cases, but not all. Trying to limit IO to some fixed unit may improve this function as well. This appears to be one of the core routines.
dsi_write seems to take varying amout of time to complete which may just be a product of the test system being busier sometimes than at other times. This should be looked into though. In all cases dsi_write seems to do IO in 1MB chunks, but the time it takes to complete it seems to go from 1 to 512. There appear to be two peaks, one around 1-2 microseconds, and the other around 32-64 microseconds.
# dtrace -n 'pid$target::dsi_write:entry { self->bs = arg2; self->ts = timestamp; } pid$target::dsi_write:return {@[self->bs] = quantize((timestamp - self->ts)/1000); self->ts = 0;}' -p 1887
dtrace: description 'pid$target::dsi_write:entry ' matched 2 probes
^C
1048576
value ------------- Distribution ------------- count
0 | 0
1 |@@@@@@@@@@ 25609
2 |@@@@@@@@@ 21831
4 | 1209
8 | 67
16 |@@ 4397
32 |@@@@@@@@ 20451
64 |@@@@ 10933
128 |@@@@ 10191
256 | 854
512 |@ 1464
1024 | 741
2048 | 277
4096 | 41
8192 | 2
16384 | 0
dsi_stream_receive seems to mostly take under 10 microseconds, but there are instances where it takes a good bit longer than that, perhaps 10 X longer, or more. This may be expected behavior which has to do with the amount of data being transmitted.
# dtrace -n 'pid$target::dsi_stream_receive:entry { self->ts = timestamp; } pid$target::dsi_stream_receive:return {@ = quantize((timestamp - self->ts)/1000); self->ts = 0;}' -p 1887
dtrace: description 'pid$target::dsi_stream_receive:entry ' matched 2 probes
^C
value ------------- Distribution ------------- count
0 | 0
1 | 4
2 | 17
4 |@@@@@@@@@@@@@@@@@@ 10443
8 |@@@@@@@@@@@@@@@@@ 10269
16 |@ 443
32 | 238
64 | 277
128 | 259
256 |@ 368
512 |@ 364
1024 |@ 880
2048 | 176
4096 | 0
8192 | 0
16384 | 0
32768 | 0
65536 | 0
131072 | 0
262144 | 3
524288 | 0
dtrace -n 'pid$target::dsi_stream_receive:entry { self->ts = timestamp; } pid$target::dsi_stream_receive:return {@ = quantize((timestamp - self->ts)/1000); self->ts = 0;}' -p 1887
# dtrace -n 'pid$target::dsi_write:entry { self->bs = arg2; self->ts = timestamp; } pid$target::dsi_write:return {@[self->bs] = quantize((timestamp - self->ts)/1000); self->ts = 0;}' -p 1887
# dtrace -n 'pid$target::pwrite64:entry { @[arg2] = count(); }' -p 1645
dtrace: description 'pid$target::pwrite64:entry ' matched 1 probe
# dtrace -n 'pid$target::pwritev:entry { this->x = (struct uio *)copyin(arg1, sizeof (struct uio)); @ = quantize(this->x->uio_resid); }' -p 1645
dtrace -n 'syscall::writev:return {this->x = ((struct iovec *)(arg1))->iov_len ; @ = quantize(this->x); }'