The table at the end of this page compares the performance of the original reiser4 key comparison functions (and znode_contains_key_strict) against the new version. The numbers were compiled from five runs. Each run's number is the result of exercising the function 100,000 times and this is repeated five times to obtain the run average.
Here is the code: keytest.zip.
What Was Changed
Key Comparison Functions
All reiser4 key comparisons are built upon the keycmp function, which returns -1 (less than), 0 (equal to) or 1 (greater than).
Here is the function as currently written:/*
compare `k1' and `k2'. This function is a heart of "key allocation policy".
All you need to implement new policy is to add yet another clause here.
*/ static inline cmp_t keycmp(const reiser4_key * k1 /* first key to compare */ , const reiser4_key * k2 /* second key to compare */ ) { cmp_t result; /* * This function is the heart of reiser4 tree-routines. Key comparison * is among most heavily used operations in the file system. */ /* there is no actual branch here: condition is compile time constant * and constant folding and propagation ensures that only one branch * is actually compiled in. */ if (REISER4_PLANA_KEY_ALLOCATION) { /* if physical order of fields in a key is identical with logical order, we can implement key comparison as three 64bit comparisons. */ /* logical order of fields in plan-a: locality->type->objectid->offset. */ /* compare locality and type at once */ result = KEY_DIFF_EL(k1, k2, 0); if (result == EQUAL_TO) { /* compare objectid (and band if it's there) */ result = KEY_DIFF_EL(k1, k2, 1); /* compare offset */ if (result == EQUAL_TO) { result = KEY_DIFF_EL(k1, k2, 2); if (REISER4_LARGE_KEY && result == EQUAL_TO) { result = KEY_DIFF_EL(k1, k2, 3); } } }
}
return result; }
The macro KEY_DIFF_EL is defined as: /* helper macro for keycmp() */ #define KEY_DIFF_EL(k1, k2, off) \ ({ \ __u64 e1; \ __u64 e2; \ \ e1 = get_key_el(k1, off); \ e2 = get_key_el(k2, off); \ \ (e1 < e2) ? LESS_THAN : ((e1 == e2) ? EQUAL_TO : GREATER_THAN); \ }) static inline __u64 get_key_el(const reiser4_key * key, reiser4_key_field_index off) { return d64tocpu(&key->el[off]); }So, under a little-endian architecture, the above code would expand to something like:
result = (k1.el[0].datum < k2.el[0].datum) ? LESS_THAN : (k1.el[0].datum = = k2.el[0].datum) : EQUAL_TO : GREATER_THAN;
if (result == EQUAL)
result = (k1.el[1].datum < k2.el[1].datum) ? LESS_THAN : (k1.el[1].datum == k2.el[1].datum) : EQUAL_TO : GREATER_THAN;
if (result == EQUAL)
/* test next key element... */
return result;Note that functions built upon keycmp, keylt for instance, simply test the cmt_t result from keycmp:
keylt ( k1, k2 ) {
}
return ( keycmp(k1, k2) == LESS_THAN);
The new code eliminates conditionals in keycmp as well as in the functions that wrap it.
Here are the new keycmp and keylt functions:
/* return type from keycmp */
typedef enum { LESS_THAN_NEW=0, GREATER_THAN_NEW=1, EQUAL_TO_NEW=2 } cmp_t_new;
static inline cmp_t_new
keycmp_new( const reiser4_key * k1 /* first key to compare */ ,
const reiser4_key * k2 /* second key to compare */ )
{
__u64 v1, v2;
/*
Calling this version with
k1[0] < k2[0], will make 1 64-bit compare and one branch.
k1[0] = k2[0], will make 1 64-bit compare and one branch (and so on)
k1[0] > k2[0], will make 1 64-bit compare and one branch.
Calling the original function with
k1[0] < k2[0], will make 1 64-bit compare, 1 int compare and two branches.
k1[0] = k2[0], will make 1 64-bit compare, 1 int compare and three branches (and so on)
k1[0] > k2[0], will make 1 64-bit compare, 1 int compare and three branches.
per key element compared.
*/
if ((v1= get_key_el(k1,0)) != (v2=get_key_el(k2,0)))
return (v1 > v2);
if ((v1=get_key_el(k1,1)) != (v2=get_key_el(k2,1)))
return (v1 > v2);
if ((v1= get_key_el(k1,2)) != (v2=get_key_el(k2,2)))
return (v1 > v2);
if ((v1=get_key_el(k1,3)) != (v2=get_key_el(k2,3)))
return (v1 > v2);
return EQUAL_TO_NEW;
}
/*
keylt
Calling this version with
k1[0] < k2[0], will make 1 64-bit compare and one branch.
k1[0] = k2[0], will make 1 64-bit compare and one branch.
k1[0] > k2[0], will make 1 64-bit compare and one branch.
Calling the original function
k1[0] < k2[0], will make 1 64-bit compare, 2 int compares and two branches.
k1[0] = k2[0], will make 1 64-bit compare, 2 int compares and three branches.
k1[0] > k2[0], will make 1 64-bit compare, 2 int compares and three branches.
*/
static inline int
keylt_new( const reiser4_key * k1 /* first key to compare */ ,
const reiser4_key * k2 /* second key to compare */ )
{
__u64 v1, v2;
if ((v1= get_key_el(k1,0)) != (v2=get_key_el(k2,0)))
return (v1 < v2);
if ((v1=get_key_el(k1,1)) != (v2=get_key_el(k2,1)))
return (v1 < v2);
if ((v1= get_key_el(k1,2)) != (v2=get_key_el(k2,2)))
return (v1 < v2);
return (get_key_el(k1,3) < get_key_el(k2,3));
}The cmp_t enum is redefined because the comparison in keycmp when encountering a non-equal key element is >. This makes LESS and GREATER be 0 and 1 (boolean comparison result) respectively, with EQUAL defined as 2. The other comparison functions are no longer defined "on top of" keycmp. They each use the comparison for which they are named.
Here is a comparison of the assembler instructions for keycmp vs the new version.
znode_contains_key_strict
This function is currently built with calls to keyge and keycmp. The new, purpose-built version uses the new keyge and an inline version of the new keylt code in order to accommodate the isunique parameter. Here are the function definitions:
/*
true if @key is strictly within @node we are looking
for possibly non-unique key and it is item is at the edge of @node.
May be it is in the neighbor.
*/
static int
znode_contains_key_strict( znode * node /* node to check key against */ ,
const reiser4_key * key /* key to check */,
int isunique)
{
int answer;
if (keyge(key, &node->rd_key))
return 0;
answer = keycmp(&node->ld_key, key);
if (isunique)
return answer != GREATER_THAN;
else
return answer == LESS_THAN;
}
/* new version */
static int
znode_contains_key_strict_new( znode* node /* node to check key against */,
const reiser4_key * sk /* key to check */,
const int isunique )
{
__u64 v1, v2;
if (keyge_new(sk, &node->rd_key))
return FALSE;
do {
if((v1=get_key_el(&node->ld_key,0))!=(v2=get_key_el(sk,0)))
break;
if((v1=get_key_el(&node->ld_key,1))!=(v2=get_key_el(sk,1)))
break;
if((v1=get_key_el(&node->ld_key,2))!=(v2=get_key_el(sk,2)))
break;
v1 = get_key_el(&node->ld_key,3);
v2 = get_key_el(sk,3);
} while(0);
/*
'trick' with isunique addition removes a branch.
isunique initializer must be changed to be the
result of a boolean comparison of the flag value &
(h->flags & CBK_UNIQUE).
*/
return (v1 < (v2 + isunique));
}
Further znode_contains_key_strict Improvements.
There is a second version that leverages an as-yet-unimplemented znode structure member: delim_key_diff_el. This unsigned char holds the key element at which the ld_key is smaller than the rd_key. With this "hint," znode_contains_key_strict only needs to make one 64-bit key el comparison between the search key and the ld_key when the search key differs from the rd_key at the delim_key_diff_el element. Otherwise, it is less than or greater than depending upon whether the diff_el is less than or greater than the delim_key_diff_el. Feedback is welcome as to whether the the apparent cache scanning performance increase justifies the storage and maintenance costs for the new znode member.
static int
znode_contains_key_strict_hint( znode* node /* node to check key against */,
const reiser4_key * sk /* key to check */,
const int isunique )
{
__u64 v1, v2;
unsigned char el=0;
do {
if ((v1=get_key_el(&node->rd_key,el)) != (v2=get_key_el(sk,el)))
break;
/*
could be explicit el=1 etc.,
but gcc was smart enough to detect this.
*/
el++;
if ((v1=get_key_el(&node->rd_key,el)) != (v2=get_key_el(sk,el)))
break;
el++;
if ((v1=get_key_el(&node->rd_key,el)) != (v2=get_key_el(sk,el)))
break;
el++;
v1 = get_key_el(&node->rd_key,el);
v2 = get_key_el(sk,el);
} while(0);
/*
this conditional and the next were originally in one
|| test, but the performance was not what I expected. I
looked at the assembler and gcc decided to test the 'easier'
char comparison first, but this is not what the logic requires...
*/
if (v2 >= v1)
return FALSE;
/*
the addition with isunique bypasses the extra conditional in the
original code and appears to give correct results. isunique should
be changed to 0 or 1 instead of the result of (h->flags & CBK_UNIQUE).
there's an extra bonus in this purpose-built function for
big endian systems. v2 contains the swab()ed value from the
last comparison with the rd_key, thus saving swab() call(s) in
addition to the comparison(s).
*/
if(el == node->delim_key_diff_el)
return get_key_el(&node->ld_key, el) < (v2+isunique);
return (el > node->delim_key_diff_el);
}Impact to the System
- Calls to znode_contains_key_strict remain unchanged.
- Calls to keylt, keygt, etc. remain unchanged.
- Calls to keycmp that test the return value in a switch should perhaps have the switch cases reordered since the enum values have changed. There are only three calls: in plugin/item/internal.c, /plugin/node/node40.c and search.c. The call in search.c is in znode_contains_key_strict, which would get the new implementation which does not use keycmp.
- Calls to keycmp that only test for EQUAL_TO can use a truth table array to bypass the conditional. There are only three calls, all in kassign.c: key_id_cmp(), key_id_key_cmp() and de_id_cmp(). These functions do not test the return value, but their callers check the return value against the cmp_t EQUAL_TO. These functions have, respectively 0, 0, and 2 (in plugin/dir/dir.c) callers, at least in the code snapshot I have.
Caveats
- Testbed compiled and tested under cygwin (we're moving and my linux box is not available, so I'm working on my laptop)
Test Notes
- Cache scanning uses an array of size CBK_CACHE_SLOTS (16) not a linked list. Using the list walking did not seem necessary to test function performance.
- The cache array nodes get progressively 'smaller.' The first rd_key starts with a copy of MAXIMAL_KEY with el[3].datum - 1. Each rd_key is equal to the prior ld_key and each cache[n].ld_key is a copy of the rd_key with el[3].datum - 50000.
- Note that in the search for MAXIMAL_KEY the 'hint' version is slower. This is because the function must determine the diff_el between the search_key and the rd_key. Since most cache searches should be for nodes already in the cache, this should not prove to be an issue, correct?
- Only tested for little endian config on Intel.
Further Opportunities
- If the new functions are valid, does the performance increase mean that the number of cache slots should be reexamined? I suspect the answer is 'no,' as the cache size has been tuned, but I thought I'd ask.
Search For | Function | Data | Total | Orig/New |
cache[15].ld_key+1 | znode_contains_key_strict | Average CPU time | 0.27960 | |
Min CPU Time | 0.27840 | |||
Max CPU Time | 0.28040 | |||
znode_contains_key_strict_hint | Average CPU time | 0.05816 | 4.8074 | |
Min CPU Time | 0.05800 | |||
Max CPU Time | 0.05820 | |||
znode_contains_key_strict_new | Average CPU time | 0.09444 | 2.9606 | |
Min CPU Time | 0.09400 | |||
Max CPU Time | 0.09620 | |||
MAXIMAL_KEY | znode_contains_key_strict | Average CPU time | 0.14140 | |
Min CPU Time | 0.14020 | |||
Max CPU Time | 0.14420 | |||
znode_contains_key_strict_hint | Average CPU time | 0.06620 | 2.1360 | |
Min CPU Time | 0.06420 | |||
Max CPU Time | 0.07020 | |||
znode_contains_key_strict_new | Average CPU time | 0.03720 | 3.8011 | |
Min CPU Time | 0.03600 | |||
Max CPU Time | 0.03800 | |||
MINIMAL_KEY | znode_contains_key_strict | Average CPU time | 0.09720 | |
Min CPU Time | 0.09600 | |||
Max CPU Time | 0.09800 | |||
znode_contains_key_strict_hint | Average CPU time | 0.02560 | 3.7969 | |
Min CPU Time | 0.02400 | |||
Max CPU Time | 0.02600 | |||
znode_contains_key_strict_new | Average CPU time | 0.03420 | 2.8421 | |
Min CPU Time | 0.03420 | |||
Max CPU Time | 0.03420 | |||
n/a | keycmp | Average CPU time | 0.13660 | |
Min CPU Time | 0.13620 | |||
Max CPU Time | 0.13820 | |||
keycmp_new | Average CPU time | 0.06660 | 2.0511 | |
Min CPU Time | 0.06620 | |||
Max CPU Time | 0.06820 | |||
keyeq | Average CPU time | 0.21880 | ||
Min CPU Time | 0.21640 | |||
Max CPU Time | 0.22840 | |||
keyeq_new | Average CPU time | 0.04720 | 4.6356 | |
Min CPU Time | 0.04600 | |||
Max CPU Time | 0.04800 | |||
keyge | Average CPU time | 0.13500 | ||
Min CPU Time | 0.13420 | |||
Max CPU Time | 0.13620 | |||
keyge_new | Average CPU time | 0.05840 | 2.3116 | |
Min CPU Time | 0.05800 | |||
Max CPU Time | 0.06000 | |||
keygt | Average CPU time | 0.13460 | ||
Min CPU Time | 0.13220 | |||
Max CPU Time | 0.13620 | |||
keygt_new | Average CPU time | 0.05400 | 2.4926 | |
Min CPU Time | 0.05400 | |||
Max CPU Time | 0.05400 | |||
keyle | Average CPU time | 0.14944 | ||
Min CPU Time | 0.14820 | |||
Max CPU Time | 0.15440 | |||
keyle_new | Average CPU time | 0.04256 | 3.5113 | |
Min CPU Time | 0.04020 | |||
Max CPU Time | 0.04420 | |||
keylt | Average CPU time | 0.14820 | ||
Min CPU Time | 0.14620 | |||
Max CPU Time | 0.15020 | |||
keylt_new | Average CPU time | 0.03840 | 3.8594 | |
Min CPU Time | 0.03800 | |||
Max CPU Time | 0.04000 |
(C) 2004 David Dabbs david at dabbs dot net, in accordance with reiser4/README.