Content area
In a large-scale distributed network, a naming service is used to achieve location transparency and provide effective content discovery. However, fast and accurate name retrieval in the massive name set is laborious. Approximate set membership data structures, such as Bloom filter and Cuckoo filter, are very popular in distributed information systems. They obtain high query performance and reduce memory requirements through the abstract representation of information, but at the cost of introducing query error rates, which will ultimately affect content service quality. In this paper, in order to obtain higher space utilization and a lower query false positive rate, we propose a flexible fingerprint cuckoo filter (FFCF) for information storage and retrieval, which can change the length and type of fingerprints adaptively. In our scheme, FFCF uses longer fingerprints under low occupancy and has the ability to correct errors by changing the type of stored fingerprints. Moreover, we give a theoretical proof and evaluate the performance of FFCF by experimental simulations with synthetic data sets and real network packets. The results demonstrate that FFCF can improve memory utilization, significantly reduce false positive errors by nearly 90
Introduction
In order to improve transmission efficiency and enhance content security, large-scale distributed networks often provide a mechanism for automatically locating content by mapping its name to its storage location [1]. The performance of this naming service directly affects the overall quality of network data services. In recent years, several projects have been conducted for the research and development of content-centric network architectures [2, 3, 4–5]. These projects use hierarchical or flat naming methods to name content, and communication is based on the content name rather than the location and address of the content source. However, longer and unbounded content names make the longest prefix match [6] no longer applicable to name lookup. How to find content name quickly and accurately from the massive set of names becomes an important issue of distributed network.
To speed up processing, many information systems and network applications use approximate set membership data structure (ASMDS) for membership testing, such as packet classification [7], IP search [8], cache sharing [9], network security monitoring [10], inter-domain routing [11] and many other applications [12], recent work [13] has also applied it to name lookup. Membership testing determines whether an item x belongs to the set S. Generally, the accurate representation of the set S is stored in external storage or slower-access memory, such as a disk-based database; subsequent access to external storage is performed only if it is determined that item x belongs to the set S, otherwise no further processing is required. The most popular ASMDS is the Bloom filter (BF) [14]. And there has been a lot of work [15, 16–17] to improve against the standard BF to support deletion, but at the cost of a significant sacrifice in space efficiency or retrieval performance. The Cuckoo filter (CF) [18] is a deletion-enabled ASMDS proposed to address the limitations of BF, which stores the fingerprints of items in the cuckoo hash table [19]. CF has even better space efficiency and higher query performance than BF for the same configuration of false positives, because CF accesses up to two contiguous memory blocks in each query.
Nonetheless, CF still has some limitations in practice. On the one hand, the advantages of CF need to be realized with high occupancy, which is an ideal situation. In general, CF does not run at full load for a long time, and since the number of items stored can be arbitrary, it typically requires 25 additional memory, and in the worst case, 50 extra memory [20]. This not only wastes space, but also brings undesirable false alarm rate performance, which has been confirmed in the literature [21] by comparing false positive rates of CF and BF at different occupancy. On the other hand, the incidence of false positives in practice is usually not equal to the theoretical false positive rate of CF [22]. Taking the packet classification application as an example, these packets are not independent of each other in practice and the queries are often duplicated. When a large number of packets come from the same sender, a false positive means that the packets are all incorrectly classified, resulting in costly subsequent processing.
In order to obtain higher space utilization and fewer query errors, in this paper, we expect to improve the existing Cuckoo filter for information retrieval in distributed network, which should be able to support insertion and deletion, and use more flexible fingerprints to overcome the shortcomings of the Cuckoo filter in practice, mainly for the space wasting problem in CF and the false positive problem in the continuous query process. We name this filter the Flexible fingerprint cuckoo filter (FFCF), the main contributions of our paper are summarized as follows:
(1) We design a data structure for information storage and name lookup in a distributed network, FFCF, which uses flexible fingerprints to represent the elements stored in the name set. The corresponding relationship between fingerprints and elements ensures fast lookup and reliable deletion.
(2) FFCF allows fingerprints of different lengths to appear simultaneously in a bucket and adapts to dynamic load changes, the variable fingerprint length enhances its scalability and memory efficiency. Moreover, we design a set of rules for FFCF to ensure accurate identification of the various fingerprints currently stored.
(3) We give a theoretical proof that the long fingerprint of FFCF can reduce the probability of false positives in low load, and when the false positive rate of shortened fingerprint length increases at high load, changing the fingerprint type can correct errors and reduce the chance of false positive for the same lookup.
(4) We evaluate the performance of FFCF through extensive simulations. The results show that FFCF can maintain high query efficiency, improve memory utilization, and reduce false positives.
The remainder of this paper is organized as follows. Section 2 briefly reviews the related work. Section 3 introduces the FFCF and gives a theoretical proof of its reduction of the false positive rate. Section 4 provides a comprehensive performance evaluation of FFCF. Finally, we summarize our work in Sect. 5.
Related work
In this section, we briefly introduce two main types of data structures, the traditional Bloom filter and the recently proposed Cuckoo filter.
Bloom filter (BF) [14] is a probabilistic data structure for membership testing, it is usually stored in RAM to quickly judge whether an item belongs to a given set or not. A positive answer means that the item has a high probability of being in the set, and a negative answer means that it absolutely does not exist. BF is essentially an array of fixed-length bits of size m that takes up little memory, and all bits are initialized to “0” at first.
During the membership testing, the tested item is hashed by k hash functions and the results are mapped to k positions in the array. If all k positions are “1”, BF gives a positive result. If any position is “0”, the result is negative. However, BF suffers from false positive errors. For an item that is not in the set, BF may also give a positive lookup result when all k positions mapped to the item are “1”, even though those “1” are set by other elements. The false positive rate of a Bloom filter consisting of m bits and containing n elements using k independent hash functions is denoted as .
1
To obtain the ideal false positive rate, the optimal number of hash functions is set according to Eq. (2).2
BF has a small space overhead and an O(1) search time complexity, therefore it has become popular for name lookup in distributed network. Quan et al. [23] divide content names into prefixes and suffixes, they use Bloom filter to match prefixes to speed up the search, and conduct an accurate search for suffixes to reduce false positives. The authors in [24] have proposed a combination of tree bitmap and Bloom filter for name lookup, which has good scalability and memory efficiency. Since Bloom filter does not support deletion, Dai et al. [13] use the counting Bloom filter to balance the load in the lookup strategy and achieve dynamic update. Wang et al. [25] have proposed a learning Bloom filter for content name search, after name pre-filtering by recurrent neural networks [26], the Bloom filter is used to eliminate false negative errors and further reduce the false positive rate of lookup.Unlike BF, Cuckoo filter (CF) [18] natively supports deletion, it stores the fingerprint of items instead of the items themselves to conserve space and ensure the false positive rate as low as possible. Structurally, CF is a two-dimensional matrix consisting of an array of size m buckets, each bucket including s slots for storing fingerprints, where each fingerprint is computed by a hash function hf. To avoid conflicts, CF provides the locations of b optional buckets for each item x and stores its fingerprint in one of the buckets. The common configuration of CF is , , and reference [27] gives a compromise between b and s. More specifically, the positions of two buckets h1(x) and h2(x) are calculated as:
3
CF also has a false positive rate, noted as , which is related to the occupancy rate a. The association between them is shown in Eq. (4).4
Recently, several CF variants have been proposed to obtain better performance. The Simplified Cuckoo filter [28] is a simplification of CF to obtain regular graphs and provide performance analysis. The Adaptive Cuckoo Filter [22] can reselect the hash function in case of collisions and remove false positive errors. The Configurable Bucket Cuckoo Filter [21] with a flag bit can determine whether there are 3 or 4 fingerprints in each bucket and optimize the false positive rate, but it limits the fingerprint length, otherwise it may waste storage space. Vacuum Filter [20] stores the locations of two candidate buckets in the same cache line or page, to significantly improve throughput. In addition, the Dynamic Cuckoo Filter [29] supports flexible capacity expansion or reduction by linking multiple CFs, reducing space overhead at the cost of increased lookup time and false positives. Morton Filter [30] uses compressed stored blocks to logically replace the bucket and avoid unnecessary memory accesses to obtain high throughput, but its application scene is limited. Marked Cuckoo Filter [31] stores both the fingerprint and affiliation information of elements in each slot, thus supporting multiset queries and the merging of multiple filters with the same parameter. Overall, these efforts address the problem of theoretical false positives or throughput, with less research on limitations in the actual scenario, such as low space utilization and serious false positive errors, in addition, the application of cuckoo filter to name lookup still requires new attempts.Flexible fingerprint cuckoo filter construction
In this part, we propose the Flexible Fingerprint Cuckoo Filter (FFCF). First, we analyze the shortcomings of the Cuckoo filter in Sect. 3.1. Then the structure and design of FFCF are described in Sect. 3.2. The theoretical proofs are given for improvement in Sect. 3.3. Section 3.4 describes the algorithms for lookup, insertion, and deletion. Finally, Sect. 3.5 compares FFCF with a similar CF variant.
Problem statement
Cuckoo filter has some limitations in the actual scenario, one of them is the high false positive rate caused by space waste. CF maintains similar performance to BF in terms of retrieval speed and spatial efficiency, and also supports deletion, but this is actually a trade-off at the expense of query accuracy. As mentioned by Bin et al. [18], CF dominates across the board only when occupancy is high. Pedro et al. [21] also demonstrated that BF has a lower false alarm rate than CF over a wider range of load variations. In practice, CF needs to know in advance the number of elements that will be inserted in order to obtain theoretically excellent classification power, but this is not possible in most applications. For convenience, applications usually allocates 25–50% additional memory to CF [23], which is not only a waste but also disguises the impact on the false alarm rate as shown in Eq. (4).
Another shortcoming is that CF does not have the ability to correct errors. When CF is faced with multiple duplicate lookups, an error has a huge impact on the statistical factual false positive rate, as all subsequent queries on the element in error are bound to be wrong. A high false positive rate directly leads to filter failure and costly follow-up treatment. Michael et al. [22] also point out this shortcoming in their paper and try to optimize it with a set of hash function loops. Our work attempts to use flexible fingerprints in CF to optimize the false positive rate both theoretically and practically.
Design of FFCF
Fingerprints, as an abstract representation of elements in Cuckoo filter, directly affect the performance of the filter. We focus our improvements on fingerprints. The false positive rate can be improved by flexibly changing the length and type of fingerprints, and reasonable placement of various different fingerprints can also improve space utilization. For the sake of description, our cuckoo filter with flexible fingerprint is denoted as FFCF.
FFCF uses two cuckoo hash tables of m buckets and 4 slots, one is used to store the complete information of the element as the main table, the other stores the fingerprint as the filter. To support flexible fingerprint modification, it is necessary to keep the element and fingerprint in the same slot of the same bucket and perform the same operations on the two cuckoo hash tables during insertion and deletion. Unlike the fingerprint with the length of f used by the standard CF, FFCF calculates a longer fingerprint for the element x, which is recorded as fpLx, and the length is 2f. Accordingly, the positions of the two alternative buckets are modified as follows:
5
The FFCF also contains two types of short fingerprints, fpShx and fpSlx, which are computed by two independent hash functions for x separately. It can be equivalent to dividing fpLx in the middle to obtain two short fingerprints with length f, the short fingerprint with high f bit is fpShx, and the short fingerprint with low f bit is fpSlx. Therefore, when storing the fingerprint of an element x, there are three options: long fingerprint fpLx, short fingerprint fpShx, or fpSlx, which makes the fingerprint of x flexible and variable in length and hash function. As mentioned above, a bucket contains four slots, donated as s1, s2, s3, and s4, respectively. s1 and s2 can store a long fingerprint fpL together, and so can s3 and s4, so that s1 and s3 are fixed to store the high f bits of the long fingerprint. When storing short fingerprints, s1 and s3 are also used to store high short fingerprint fpSh, and s2 and s4 are used to store fpSl. FFCF uses additional flag bits in each bucket to indicate the state and number of fingerprints stored in the bucket. Of course, it brings more space cost compared to the original CF, but we also make a rule to limit the flag to 2 bits.[See PDF for image]
Fig. 1
The structure of FFCF
Specifically, the number of fingerprints stored in a bucket ranges from 0 to 4, with five cases. We can compress all cases into four states, and each state limits the corresponding flag and storage rules as follows:
State 1. Flag is 0, at most one long fingerprint may be stored in the bucket, and the fingerprint must be stored with fixed positions s1 and s2.
State 2. Flag is 1, two long fingerprints are stored in the bucket, occupying all 4 slots.
State 3. Flag is 2, two short fingerprints and one long fingerprint are stored in the bucket. s1 stores a short fingerprint fpSh, s2 stores a short fingerprint fpSl, s3 and s4 store a long fingerprint together.
State 4. Flag is 3, the bucket is full of 4 short fingerprints. s1 and s3 store fpSh, s2 and s4 store fpSl.
In this paper, we denote the two slots for storing a long fingerprint as a slot pair, i.e., s1 and s2 are collectively called slot pair sp1, and s3 and s4 are called sp2. The above rules ensure that the short fingerprints stored in FFCF must appear in pairs and be stored in slot pairs. The structure of FFCF is shown in Fig. 1.
We set up a main table to store the complete information about the item and ensure that it is in the same position as its fingerprint. Because we always use the long fingerprint fpL to determine the location of the hash bucket in FFCF, while the fingerprints stored in the bucket are very flexible. It is effortless to obtain two short fingerprints from a long fingerprint, but it is difficult to obtain a long fingerprint or another short fingerprint from a short fingerprint. Sometimes it is necessary to access the complete information of items and calculate long fingerprints, although in distributed information systems they can be stored in large memory with slow access.
Optimizing for the false positives
Compared with CF, FFCF uses more flexible fingerprints to effectively reduce false positives. In this section, we will give a theoretical proof from two aspects: proving the effect of fingerprint length and changing short fingerprint types.
Effect of fingerprint length
FFCF can dynamically adjust the fingerprint length of the stored elements according to the load, and preferentially fill the storage bucket with long fingerprints, fundamentally changing the error rate. In this paper, FFCF adopts a fixed configuration with the number of candidate buckets and the number of slots per bucket. Ideally, when the total number of fingerprints stored in the two candidate buckets does not exceed 4, we regard them all as long fingerprints; short fingerprints start to appear only when the load factor is above 50. Let us take the query of an element x not stored in FFCF as an example to discuss the relationship between occupancy rate a and false positive rate .
When the load rate , the false positives will occur only when x matches at least one fingerprint in the candidate buckets. The total number of long fingerprints in the candidate buckets is abs, and there is no short fingerprint. Since the probability of two long fingerprints being the same is , the false positive probability in this case can be deduced as follows:
6
When , short fingerprints begin to appear in a pair of candidate buckets, and the average number of short fingerprints could be considered as k. The number of long fingerprints is calculated as , the average total number of fingerprints in the candidate buckets is , which can also be expressed as abs, so .Matching with long or short fingerprints is independent of each other, as long as an element x is found in the candidate bucket, it means an error. Therefore, the false positive rate can be calculated by the probability of matching to long fingerprints fpL, short fingerprints fpSh and fpSl, denoted as , , and , respectively.
7
In summary, under ideal conditions, the relationship between the false positive rate and the load rate is as follows:8
Comparing Eqs. (4) and (8), it is obvious that FFCF has lower false positive rate than CF under almost all load conditions, the advantage of FFCF is even more pronounced when occupancy is low.Effect of changing short fingerprint types
In a practical application of the cuckoo filter, a lookup for an unsaved element x results in a false positive, which means that subsequent queries on this element will be incorrect. However, two types of short fingerprints are allowed in FFCF, which makes it possible to recalculate the short fingerprints representing y and z in the bucket by swapping their positions in the main table when an error occurs in the match between the short fingerprints of x and y, thus reducing the number of subsequent errors. We denote the two slots for storing the short fingerprints of y and z as a slot pair. Since replacing the short fingerprint can be understood as using another set of independent hash functions for FFCF, the probability of false positives when a continuous query is made on the same x is as Eq. (9). Clearly, this is much better than CF.
9
The situation is more complicated in the actual use scenario of FFCF. Although we believe that the element x may be queried again in the future, it is not guaranteed that x will be queried several times continuously. In other words, several lookups for x may be interspersed with a large number of queries for other elements. The difference in the number of false positives caused by whether FFCF enables error correction in this case is discussed next. For convenience, we will notate FFCF without error correction enabled as FFCF_NC.Given a query sequence {, ,..., } consisting of a total of N query requests for E elements not stored in the main table. Since the number of buckets has no effect on false positives, to simplify the analysis, we consider that no insertion and deletion are performed during the query of this sequence, these elements are generated randomly and all match the same slot pair of a given candidate bucket during query. When no short fingerprint is stored in the slot pair, no error correction operation can be performed and all false positives cannot be avoided. Therefore, we only focus on queries that generate false positives due to matching short fingerprints in the slot pair, and classify all false positives in the query sequence into three types:
Element set , which contains queries, the elements of which can only successfully match the initial two short fingerprints in the slot pair, and we denote the two fingerprints as ; Element set includes a total of queries, its members match to two short fingerprints regenerated after one error correction in the slot pair of FFCF and result in false positives, the regenerated fingerprints denoted as ; Element set consists of a total of queries whose members can match to both above types. Among them, false positives from can be ignored, because their probability of occurrence is much lower than the other two false positives, and they have less impact on the number of false positives when comparing FFCF and FFCF_NC.
The simplified random query sequence now consists of queries from and queries from . Without loss of generality, we assume that the short fingerprints stored in FFCF at the start of the query is . In FFCF_NC, this sequence will bring false positives regardless of the query order, but the order has a great influence on the error Err in FFCF.
In a completely random query sequence, the error correction function does not lead to a change in the number of false positives, but as the sequence becomes ordered, error correction will reduce false positives. The proof process is as follows:
Surely, if , ; if , then or 2, when the query to an element (), FFCF has one error and replaces the fingerprints as , if FFCF need to query () later, another error occurs and the fingerprint changes back to again, then no more false positives. When , there are total G gaps in queries to , and queries to must occur in these gaps. Now, these G gaps are labeled as , where represents the position before the first query and the other represents the position after the query . Put randomly into these gaps, the number of placed at is , satisfies Eq. (10), there are total possibilities.
10
Those placed in position will not contribute to false positives, in position will increase the number of false positives by . For the other positions, as long as , the number of false positives will increase by 2, so the final number of false positives Err in FFCF is Equation (11).11
12
13
Combining Eqs. (11) and (13) above, we get Eq. (14). Obviously, the value of Err is related to , Err , and we can distinguish whether it is odd or even. Furthermore, we give the expectation of false positives in a random query sequence based on whether Err is an odd or an even number in Equation (15), where we introduce a temporary variable q, to denote the number of errors Err in the odd and even cases with and 2q, respectively.14
15
After a large number of numerical calculations by the computer, Eq. (15) can be simplified to Eq. (16), and eventually is only related to G and . We define a variable Diff to represent the effect of FFCF with error correction enabled on the number of false positives.16
17
When dealing with a large random query sequence, we believe that and are approximately equal. If the sequence is completely out of order, , because can be arbitrarily placed in queries to . At this point, , indicating that error correction does not bring harm. If the sequence is ordered, , because can only be placed in two gaps, or , result such as {,..., , ,..., }, , that has brought huge improvements.Combining Eqs. (10), (16) and Equation (17), we have for a random query sequence. Diff shows the power of error correction in FFCF, which we will demonstrate more visually with the experimental results in Sect. 4.
Algorithms of FFCF
Lookup
To query whether an item x exists in FFCF, we first calculate the long fingerprint fpLx and the positions h1(x) and h2(x) of two optional buckets, then determine the fingerprint type and the corresponding slot according to the flag in each candidate bucket. For example, an alternative bucket with flag 0 is selected, which indicates that a long fingerprint may be stored in the bucket, then fpLx is used to match the long fingerprint stored in the slot pair sp1; another candidate bucket with flag 3 means that four short fingerprints are stored in this bucket, fpLx is split into two short fingerprints fpShx and fpSlx in the middle. Match the short fingerprints in slots s1 and s3 with fpShx, fingerprints in s2 and s4 with fpSlx. If any match is successful in either candidate buckets, FFCF gives a success; otherwise, a failure is returned.
[See PDF for image]
Algorithm 1
Lookup for an item x
FFCF also has an optional error correction function for lookup due to the ability to retrieve complete elements from the main table. A false positive error occurs when a query matches a short fingerprint of another item, FFCF can swap the positions of two items stored in a slot pair in the main table, and recalculate the short fingerprints in this slot pair in the filter, to replace the short fingerprint type in anticipation of eliminating false positives for future queries. The whole process of lookup is shown in Algorithm 1.
Insertion
[See PDF for image]
Algorithm 2
Insertion of an item x
When inserting x into the FFCF, we also calculate the long fingerprint fpLx, h1(x) and h2(x) first. If there is a candidate bucket whose flag is not 3, then x can be inserted directly into that bucket. Specifically, if the flag is 0 means that there are still empty slots in the bucket, just store the long fingerprint fpLx in the first empty slot pair; if the flag is 1 or 2 means no empty slots, but the slot pair storing the first long fingerprint can be used to store two short fingerprints, the high f bit of the long fingerprint is reserved to represent the original element, and the low f bit is replaced by the short fingerprint fpSlx of x. Finally, the flag should be modified to correctly represent the current state of the bucket.
When the flags of the two candidate buckets are both 3, there is no choice but to execute kick-out, because both buckets are already full of 4 short fingerprints. Randomly select a slot to kick out a short fingerprint, and store the short fingerprint of x in this position. Of course, it is necessary to ensure the correspondence between the fingerprint type and the slot. Victim needs to recalculate the long fingerprint by accessing the original data and assigning it to the location of the alternative bucket. If the flag of the backup bucket is not 3, the fingerprint is stored according to the preceding rules; otherwise, continue kicking until the maximum number of iterations is exceeded or there are no more victims. The insertion operation is shown in Algorithm 2. It should be noted that during kick-out, the original data stored in the main table also needs to be moved to keep the position of the item and fingerprint consistent.
Deletion
[See PDF for image]
Algorithm 3
Deletion of an item x
The process of deleting an item x requires a query first, and once the query fails, there is no need to delete it. If the result is successful, due to possible false positives, we continue to compare the complete element in the same position of the main table, and only when a match is found do we remove the fingerprint. The corresponding relationship between fingerprints and elements ensures reliable deletion. However, the storage rules require us to change the location and length of the remaining fingerprints in the bucket. The specific process is described in Algorithm 3. Similarly, element deletion and position shifting are performed on the main table to maintain one-to-one correspondence between fingerprints and elements.
Comparison with configurable bucket cuckoo filter
Structurally, FFCF sets a flag of 2-bits to indicate the storage state of the current bucket. Similarly, CBCF [21] uses a 1-bit flag to distinguish between two types of buckets with different fingerprint lengths. Overall, they are based on CF and improved for fingerprint length. And both require a main table to store complete elements, so as to realize the switching between fingerprints of different lengths and support reliable deletion. Therefore, it can be understood that FFCF is the further development and evolution of CBCF.
However, FFCF allows a bucket to hold fingerprints of different lengths and types, while CBCF restricts multiple fingerprints in the same bucket must be of the same length. So they have different dominant ranges in terms of false positive rate. CBCF theoretically stores long fingerprints of length in all buckets when the load factor is below 70, at which point the FPR is about . But it restricts f to multiples of 3, otherwise there would be additional wasted space. FFCF has no requirement for f. Its advantage interval over CBCF is below 50 load, the FPR is . The extra 1-bit flag reduces the contribution of this bucket to FPR by a factor of compared to CBCF. When the load rises to around 70, the FPR will be higher than that of CBCF, although the expected number of fingerprints in the FFCF bucket is 3 and the average length is also . Because the fingerprint lengths in FFCF are f, f, and 2f, the variance is larger. This design of FFCF ensures that short fingerprints always appear in pairs, so the error correction mechanism can be effectively supported without more identification bits. This is of great significance in practice, because FFCF can reduce the number of false positives in high load range, while CBCF does not have a similar function.
In addition, CBCF requires a slightly more complex operation for the conversion of storage bucket types when performing insertion in high load intervals above 75. When the long fingerprint bucket is transformed into a short fingerprint bucket, all the fingerprints need to be moved into position. FFCF can divide a fingerprint of length 2f into two short fingerprints, and the new fingerprint can be inserted by covering one of the short fingerprints directly. However, considering that the insertion time is mainly affected by the kick operation, this design advantage may contribute marginally.
Evaluation
In this section, we represent the flexible fingerprint cuckoo filter1 as “FFCF" and the FFCF without error correction as “FFCF_NC". We also implemented the following three filters: Bloom filter BF [14], Cuckoo filter CF [18], and Configurable bucket cuckoo filter CBCF [21] for comparison. Specifically, for FFCF, FFCF_NC, CF, and CBCF, we use the configuration of , , and , which enables a storage capacity of 32768 items. For the fingerprint length, three configurations of are used for each filter. Correspondingly, the optimal number of hash functions used by BF is to ensure that the storage capacity and occupied space are nearly the same as the various filters mentioned above. Our experiments use a machine with two IntelXeonSilverprocessors ([email protected]GHz, 11MBL3 cache) and 64GB of RAM.
Insertion performance
After constructing several filters as described above, we use the same set of candidate elements for insertion until the predetermined occupancy rate is reached. The storage space utilization efficiency of different filters is counted. This metric is used to evaluate the proportion of the effectively utilized storage space in the total filter space. When , the result is shown in Fig. 2.
[See PDF for image]
Fig. 2
Space utilization efficiency at different occupancy
The space utilization efficiency rises linearly with occupancy for both standard CF and BF, the number of inserted elements in CF actually indicates the space utilization, while the curve for BF never exceeds 50 because more redundant “0" bits are set to avoid false positive errors. CBCF uses more buckets storing three elements at low occupancy, similarly FFCF stores elements with long fingerprints at low occupancy, which causes their utilization to rise faster, but they also fail to reach 100 due to the flag bits in the buckets. The difference is that FFCF uses two additional bits per bucket, while CBCF uses only one bit, as a result, FFCF has slightly lower space utilization than CBCF at high occupancy. Their multiple fingerprints are able to occupy almost all allocated space, which is also reflected in their superiority over CF at 95 occupancy.
[See PDF for image]
Fig. 3
Insertion time at different occupancy
Next, we measured the time to fill these filters from construction with different occupancy. Figure 3 shows this result for . BF takes the longest time, because each element insertion requires 12 hashes to turn the corresponding bit into “1". Various CFs only need 2 hashes to get the fingerprint and storage location, which is indeed much faster. CBCF and FFCF also maintain the main table to store the original data, and thus take slightly longer than CF.
Lookup performance
After inserting the items into the filters, in this section, we will test the false positive rate (FPR) of lookup at different occupancy rates. The set used for the query contains 100 million randomly generated elements, and it should be noted that none of them belong to the set of inserted elements. In other words, a successful query means an error. The same experiments are also carried out with several filters with , 12, and 16. The average drawing results of multiple tests are shown in Fig. 4.
[See PDF for image]
Fig. 4
Effect of fingerprint length and occupancy on false positive rate
The comparison between CF and BF shows that when the occupancy rate is lower than 80, the false positive rate of CF is always higher than BF, and the curve of BF decreases more as the occupancy rate decreases. With increasing fingerprint length and occupancy, for example, when and occupancy is about 90, CF begins to outperform BF. This proves once again that it is difficult for CF to fully take advantage of its theoretical advantages in practical use, because CF does not always run at a high occupancy rate of more than 90, which is a stringent requirement.
CBCF always has a lower false positive rate than CF due to the design of two types of storage buckets. When the occupancy rate is below 70, it almost contains only buckets with three fingerprints, and as the occupancy rate increases, the percentage of buckets storing four fingerprints increases rapidly, making the curve start to converge upward to CF, while it may outperform BF in a larger range of high occupancy rates compared to CF.
FFCF uses longer fingerprints at low occupancy, which reduces the false positive rate by several orders of magnitude compared to CF, and becomes more pronounced with increasing fingerprint length; at 30 occupancy, the error rate is only 1/10, 1/1000 and 1/100,000 of CF respectively. It is even lower than BF at low occupancy, which cannot be achieved by CF and CBCF. As the occupancy rate increases, short fingerprints appear in the bucket. According to our previous theoretical analysis, short fingerprints contribute greatly to the error rate, which leads to a sharp increase in the false positive rate curve from 40 to 60 of the occupancy. In addition, since the insertion of data does not ideally guarantee the balance of each bucket, very few buckets store more than two fingerprints even at 40 occupancy, implying that at least two of them are short, and these short fingerprints cause a slight deviation of the curve from the theoretical analysis value. After the occupancy rate exceeds 60, the false positive rate of FFCF is higher than that of CBCF because there are more short fingerprints in FFCF. Of course, it still maintains a certain advantage over CF and remains so until 95 occupancy, which is the benefit of more efficient space utilization. In addition, FFCF and FFCF_NC have almost no difference in this group of random trials with large amount of data, which does not mean that the error correction function of FFCF is meaningless. The inability of error correction to bring significant benefits in a completely random query sequence precisely supports our theoretical analysis.
To test the effect of error correction, we use the same data to conduct the query experiment again. The difference is that the query for each element will be repeated ten times and one hundred times consecutively, and the results are shown in Figs. 5 and 6, respectively.
[See PDF for image]
Fig. 5
Effect of fingerprint length and occupancy on false positive rate (10 consecutive queries)
[See PDF for image]
Fig. 6
Effect of fingerprint length and occupancy on false positive rate (100 consecutive queries)
CF, BF, and CBCF do not have error correction capability, and the results of multiple consecutive queries on the same element do not change, so their false positive rates are not affected, and the same is true for FFCF_NC without error correction enabled.
FFCF shows its uniqueness at this time, most false positives caused by wrong matching to short fingerprints can be eliminated after correction. In the case of 95 high occupancy rate, the false positives rate of FFCF is only about 1/10 and 1/100 of FFCF_NC, which shows that the error correction ability of FFCF is indeed very powerful, and the more repeated queries, the better the effect reflected in the false positive rate. However, it seems that FFCF still does not play a role at a low occupancy of 30, because error correction cannot be performed when there is no short fingerprint stored in FFCF. At this time, all false positive errors come from long fingerprints, although this probability is already quite small.
With the increase of fingerprint length f, the separation degree of FFCF_NC and FFCF curves becomes larger at low occupancy, which is most obvious at 40 occupancy. This is because short fingerprints have appeared in some buckets in FFCF, and although the number is small, the impact on the error rate is much greater than long fingerprints. As previously analyzed, the probability that an item randomly matches a short fingerprint is , while the probability of matching a long fingerprint is . Therefore, the larger f, the larger proportion of false positives caused by the same number of short fingerprints.
In the query of repeated elements, FFCF has fully demonstrated the benefits of error correction and can obtain a false positive rate much lower than other filters. This comes at the cost of two extra flag bits in each bucket and the additional time overhead of error correction operations. In the following, we test the time for negative queries with different occupancy rates. We still select the previous test data set with the elements not stored in the filter, the number of queries is set to 100 million. The result obtained after several tests is shown in Fig. 7.
[See PDF for image]
Fig. 7
Time of 100 million negative queries at different occupancy rates
BF requires the least query time due to the fact that it only needs to find a “0" among several bits representing the element being checked in a negative query to return the result immediately, while it takes longer to find the first “0" bit as the number of bits set to “1" in BF increases with occupancy, so its query time also increases with occupancy.
CF, on the other hand, has a nearly horizontal query time curve because each negative query requires a traversal to access a total of eight slots in two candidate buckets, which is not correlated with how many fingerprints are actually stored in the buckets, and thus CF’s query time is not affected by the load.
Similarly, FFCF and CBCF also need to traverse all slots of two candidate buckets to complete a negative query, but they need to calculate two types of fingerprints, access the flag bit of the bucket to determine the type of fingerprints, and finally match with fingerprints of the same type, which makes the query take longer than CF. However, their curves are not horizontal because the number of fingerprints read when traversing the candidate buckets is not fixed 8 times. CBCF requires access to only three slots in each bucket on average when the occupancy is less than 75.
FFCF is more flexible and can determine the number of stored fingerprints in each bucket according to the flag bit, which makes the average number of fingerprints per query vary approximately linearly based on occupancy. Thus, when the occupancy rate is 30, FFCF is even slightly better than CF.
Finally, FFCF takes longer than FFCF_NC and increases with occupancy, mainly because the higher the occupancy, the more false positives occur and the longer time is consumed to correct the errors. Overall, FFCF’s query delay increases only slightly, which is perfectly acceptable compared to the improvement in the false positive rate it brings.
Name lookup in practice
In this section, we will verify the effectiveness of FFCF on real network packets and evaluate the accuracy of name lookup in practice. The experiment uses WIDE2020 networkdataset.2 The number of selected packets exceeds 10 million, the source IP and destination IP in a packet are regarded as a name and inserted into the filter as an item. After reaching a specific occupancy rate, all packets not inserted into the filter are queried in random order, and the false positive rate is counted. The results under different configurations of fingerprint length are shown in Fig. 8.
[See PDF for image]
Fig. 8
False positive rate in practice
The change trend of CF is the smoothest, while BF has the largest change with the occupancy rate. BF even has no error in the case of low occupancy, while the false positive rate is sometimes higher than other types of CF. This mutation is also completely understandable because the inserted elements are no longer randomly and uniformly distributed in the sample set space. Although the increase in the number of inserted elements leads to the increase in the occupancy rate, it does not significantly increase the number of “1" bits in BF at low occupancy, so it will hardly bring false positives. When false positives occur, the real data set includes a large number of repeated queries, which also makes the actual false positive rate higher than the theoretical value, leading to the drastic change of BF curve.
CBCF sometimes outperforms FFCF_NC at low occupancy when and for a similar reason, because the insertion elements are not equally distributed in each bucket. The advantage of FFCF_NC at low occupancy will be weakened when there are three or four fingerprints stored in some buckets. The drop in FFCF’s false positive rate compared to FFCF_NC started with the occupancy rate of 30, which is enough to show that some short fingerprints are already stored in FFCF at this time, which is also caused by the imbalance when inserting elements. At the same time, FFCF still effectively corrects the errors and eliminates false positives in this case, and obtains the optimal false positive rate in this part of the actual test.
Deletion performance
Finally, we give a brief evaluation of the deletion performance of FFCF and compare it with several other filters. We perform a deletion operation on the filters inserted to 95 occupancy in Sect. 4.1, remove all previously inserted elements in a random order, and count the deletion time. Since BF does not support deletion, we only experimented with CF, CBCF, and FFCF. To ensure the scientific nature of the results, the average value of 100 repeated experiments has been selected, and the comparison results are shown in Table 1.
Table 1. Deletion performance of filters
Filter | Deletion support | Reliability | Deletion time (ms) |
|---|---|---|---|
BF | No | – | – |
CF | Yes | No | 3.427 |
CBCF | Yes | No | 7.260 |
FFCF | Yes | Yes | 6.497 |
With the exception of BF, several cuckoo filters support deletion, but there are differences in the reliability of deletion. CF and CBCF delete fingerprints based on the result of lookup, which cannot guarantee the correct deletion due to potential false positives. Different from them, FFCF deletes the elements in the main table first, and then remove the fingerprints, ensuring reliable deletion with the help of accurate matching of complete element information, this is also an advantage brought by the exact correspondence between the main table elements and the filter fingerprints.
CF takes the least time emptying registered elements, which is inseparable from its fast lookup speed, as shown in Fig. 7 in Sect. 4.2. FFCF, on the other hand, is more complicated due to the storage of complete elements, requiring simultaneous deletion of elements and fingerprints, resulting in approximately twice the time of CF. At the same time, the experimental results also show that FFCF has a higher deletion efficiency than CBCF, which is mainly because the flag of FFCF represents the state of the bucket more accurately and facilitate lookup.
Discussion and future work
Our research focuses on naming service in distributed networks, we propose the probabilistic data structure FFCF for information storage and fast retrieval in distributed networks, mainly to reduce the number of false positives in actual name lookup. In fact, FFCF can be extended to a wider range of web applications to speed up lookups and saving space. For example, for spam blocking, the blacklist is imported into FFCF to generate fingerprints, and when an email is received, it is first looked up by FFCF and considered as spam when the result is positive. Similarly, FFCF is suitable for retrieval scenarios such as duplicate record elimination, spell checking, computer virus scanning, etc. The only requirement is that the complete data record needs to be stored locally.
In practice, FFCF can be implemented in a variety of flexible configurations depending on requirements. For example, turning off the error correction function during query can improve the search speed. Initializing with more buckets allows FFCF to have more capacity without compromising query efficiency. The FFCF in this paper is configured to contain 4 slots per bucket, more or less slots are also feasible, a minimum of two slots is enough to take advantage of FFCF, more slots will face problems in terms of retrieval efficiency and false positives because each search will face more options.
In future work, we intend to further improve FFCF. On the one hand, we will make it support dynamic expansion at runtime without being limited to initial configuration. On the other hand, the advantage of low false positives in FFCF can be extended to multi-set lookup problems, which will provide new solutions for applications such as access rights control, multi-set content splitting, and merging.
Conclusion
In this paper, we have proposed the Flexible fingerprint cuckoo filter (FFCF) for information storage and name lookup in a distributed network, which is mainly optimized to address the problems of traditional probabilistic structures in practical retrieval scenarios. In our proposed scheme, FFCF enables fast lookup and supports reliable deletion, although it requires storing the complete elements in the main table. FFCF gives three optional types of fingerprints in each storage bucket, we design a set of rules to allow these different types of fingerprints to be stored simultaneously in the same bucket and dynamically adjusted with load, which also facilitates better space utilization. Flexible fingerprints can improve retrieval performance, using longer fingerprints can radically reduce false rate, and two short fingerprints can be switched to avoid false positives in subsequent queries, which is significant in practical applications, we also give theoretical analysis demonstrating this. To prove the efficiency of our proposed work, we have conducted a comprehensive test of FFCF, results show that FFCF has more accurate lookup performance than the traditional cuckoo filter and provides a new direction for name lookup.
Acknowledgements
This work was funded by the Strategic Leadership Project of the Chinese Academy of Sciences: SEANET Technology Standardization Research System Development (Project No. XDC02070100) and the National Key R &D Program of China: Polymorphic Intelligent Network Environment (PINE) for Testing and Demonstrations (Project No. 2020YFB1806402).
Author Contributions
All the authors listed contributed to the concept and design of the manuscript. WL and JY performed the experiment, wrote the manuscript and prepared the figures. JW performed the data analyses with constructive discussions and critically revised the manuscript.
Declarations
Conflict of interest
The authors declare no Conflict of interest.
https://github.com/Hawellian/FFCF.
2http://mawi.wide.ad.jp/mawi.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Little, MC; Shrivastava, SK; Speirs, NA. Using bloom filters to speed-up name lookup in distributed systems. Comput. J.; 2002; 45,
2. Koponen, T., Chawla, M., Chun, B., Ermolinskiy, A., Kim, K.H., Shenker, S., Stoica, I.: A data-oriented (and beyond) network architecture. In: Proceedings of the ACM SIGCOMM 2007 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pp. 181–192. ACM, Kyoto (2007). https://doi.org/10.1145/1282380.1282402
3. Zhang, L; Afanasyev, A; Burke, J; Jacobson, V; Claffy, KC; Crowley, P; Papadopoulos, C; Wang, L; Zhang, B. Named data networking. Comput. Commun. Rev.; 2014; 44,
4. Jinlin, W; Gang, C; Jiali, Y; Peng, S. Seanet: architecture and technologies of an on-site, elastic, autonomous network. J. Netw. New Media; 2020; 6, pp. 1-8.
5. Venkataramani, A; Kurose, JF; Raychaudhuri, D; Nagaraja, K; Mao, ZM; Banerjee, S. Mobilityfirst: a mobility-centric and trustworthy internet architecture. Comput. Commun. Rev.; 2014; 44,
6. Dharmapurikar, S; Krishnamurthy, P; Taylor, DE. Longest prefix matching using bloom filters. IEEE ACM Trans. Netw.; 2006; 14,
7. Abdulhassan, AA; Ahmadi, M. Many-field packet classification using CR-tree. J. High Speed Netw.; 2020; 26,
8. Kwon, M., Reviriego, P., Pontarelli, S.: A Length-aware cuckoo filter for faster IP lookup. In: IEEE Conference on Computer Communications Workshops, INFOCOM, pp. 1071–1072. IEEE, San Francisco (2016). https://doi.org/10.1109/INFCOMW.2016.7562258
9. Fan, L; Cao, P; Almeida, JM; Broder, AZ. Summary cache: a scalable wide-area web cache sharing protocol. IEEE ACM Trans. Netw.; 2000; 8,
10. Grashöfer, J., Jacob, F., Hartenstein, H.: Towards application of cuckoo filters in network security monitoring. In: 14th International Conference on Network and Service Management. CNSM, pp. 373–377. IEEE Computer Society, Rome (2018)
11. Gao, W; Nguyen, JH; Wu, Y; Hatcher, WG; Yu, W. Routing in large-scale dynamic networks: a bloom filter-based dual-layer scheme. ACM Trans. Internet Technol.; 2020; 20,
12. Broder, AZ; Mitzenmacher, M. Survey: network applications of bloom filters: a survey. Internet Math.; 2003; 1,
13. Dai, H; Lu, J; Wang, Y; Pan, T; Liu, B. BFAST: high-speed and memory-efficient approach for NDN forwarding engine. IEEE ACM Trans. Netw.; 2017; 25,
14. Bloom, BH. Space/time trade-offs in hash coding with allowable errors. Commun. ACM; 1970; 13,
15. Bonomi, F., Mitzenmacher, M., Panigrahy, R., Singh, S., Varghese, G.: An improved construction for counting bloom filters. In: Algorithms—ESA 2006, 14th Annual European Symposium, vol. 4168, pp. 684–695. Springer, Zurich (2006). https://doi.org/10.1007/11841036_61
16. Bender, MA; Farach-Colton, M; Johnson, R; Kraner, R; Kuszmaul, BC; Medjedovic, D; Montes, P; Shetty, P; Spillane, RP; Zadok, E. Don’t thrash: how to cache your hash on flash. Proc. VLDB Endow.; 2012; 5,
17. Rothenberg, CE; Macapuna, CAB; Verdi, FL; Magalhães, MF. The deletable bloom filter: a new member of the bloom family. IEEE Commun. Lett.; 2010; 14,
18. Fan, B., Andersen, D.G., Kaminsky, M., Mitzenmacher, M.: Cuckoo filter: practically better than bloom. In: Proceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies, pp. 75–88. ACM, Sydney (2014). https://doi.org/10.1145/2674005.2674994
19. Pagh, R; Rodler, FF. Cuckoo hashing. J. Algorithms; 2004; 51,
20. Wang, M; Zhou, M; Shi, S; Qian, C. Vacuum filters: more space-efficient and faster replacement for bloom and cuckoo filters. Proc. VLDB Endow.; 2019; 13,
21. Reviriego, P; Martínez, JA; Larrabeiti, D; Pontarelli, S. Cuckoo filters and bloom filters: comparison and application to packet classification. IEEE Trans. Netw. Serv. Manag.; 2020; 17,
22. Mitzenmacher, M; Pontarelli, S; Reviriego, P. Adaptive cuckoo filters. ACM J. Exp. Algorithmics; 2020; 25, pp. 1-20.4102854 [DOI: https://dx.doi.org/10.1145/3339504]
23. Quan, W; Xu, C; Guan, J; Zhang, H; Grieco, LA. Scalable name lookup with adaptive prefix bloom filter for named data networking. IEEE Commun. Lett.; 2014; 18,
24. Quan, W., Xu, C., Vasilakos, A.V., Guan, J., Zhang, H., Grieco, L.A.: TB2F: tree-bitmap and Bloom-filter for a scalable and efficient name lookup in content-centric networking. In: 2014 IFIP Networking Conference, pp. 1–9. IEEE Computer Society, Trondheim (2014). https://doi.org/10.1109/IFIPNetworking.2014.6857122
25. Wang, Q., Wu, Q., Zhang, M., Zheng, R., Zhu, J.: Learned Bloom-filter for an efficient name lookup in information-centric networking. In: 2019 IEEE Wireless Communications and Networking Conference, pp. 1–6. IEEE, Marrakesh (2019). https://doi.org/10.1109/WCNC.2019.8885886
26. Sutskever, I; Vinyals, O; Le, QV. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems; 2014; Montreal, MIT Press: pp. 3104-3112.
27. Erlingsson, U., Manasse, M., McSherry, F.: A cool and practical alternative to traditional hash tables. In: Proc. 7th Workshop on Distributed Data and Structures (WDAS’06), pp. 1–6. ACM, Santa Clara (2006)
28. Eppstein, D.: Cuckoo filter: simplification and analysis. In: 15th Scandinavian Symposium and Workshops on Algorithm Theory, vol. 53, pp. 8–1812. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Reykjavik (2016). https://doi.org/10.4230/LIPIcs.SWAT.2016.8
29. Chen, H., Liao, L., Jin, H., Wu, J.: The dynamic cuckoo filter. In: 25th IEEE International Conference on Network Protocols, ICNP, pp. 1–10. IEEE Computer Society, Toronto (2017). https://doi.org/10.1109/ICNP.2017.8117563
30. Breslow, AD; Jayasena, N. Morton filters: fast, compressed sparse cuckoo filters. VLDB J.; 2020; 29,
31. Luo, L; Guo, D; Zhao, Y; Rottenstreich, O; Ma, RTB; Luo, X. MCFsyn: a multi-party set reconciliation protocol with the marked cuckoo filter. IEEE Trans. Parallel Distrib. Syst.; 2021; 32,
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.