javatools.parsers
Class BloomFilter<E>

java.lang.Object
  extended by javatools.parsers.BloomFilter<E>
Type Parameters:
E - Object type that is to be inserted into the Bloom filter, e.g. String or Integer.
All Implemented Interfaces:
java.io.Serializable

public class BloomFilter<E>
extends java.lang.Object
implements java.io.Serializable

Implementation of a Bloom-filter, as described here: http://en.wikipedia.org/wiki/Bloom_filter Inspired by the SimpleBloomFilter-class written by Ian Clark. This implementation provides a more evenly distributed Hash-function by using a proper digest instead of the Java RNG. Many of the changes were proposed in comments in his blog: http://blog.locut.us/2008/01/12/a-decent-stand-alone-java-bloom-filter-implementation/

Author:
Magnus Skjegstad
See Also:
Serialized Form

Constructor Summary
BloomFilter(int bitSetSize, int expectedNumberOfFilterElements)
          Constructs an empty Bloom filter.
 
Method Summary
 void add(E element)
          Adds an object to the Bloom filter.
 void addAll(java.util.Collection<? extends E> c)
          Adds all elements from a Collection to the Bloom filter.
 void clear()
          Sets all bits to false in the Bloom filter.
 boolean contains(E element)
          Returns true if the element could have been inserted into the Bloom filter.
 boolean containsAll(java.util.Collection<? extends E> c)
          Returns true if all the elements of a Collection could have been inserted into the Bloom filter.
 int count()
          Returns the number of elements added to the Bloom filter after it was constructed or after clear() was called.
static long createHash(byte[] data)
          Generates a digest based on the contents of an array of bytes.
static long createHash(java.lang.String val)
          Generates a digest based on the contents of a String.
static long createHash(java.lang.String val, java.lang.String charset)
          Generates a digest based on the contents of a String.
 boolean equals(java.lang.Object obj)
          Compares the contents of two instances to see if they are equal.
 double expectedFalsePositiveProbability()
          Calculates the expected probability of false positives based on the number of expected filter elements and the size of the Bloom filter.
 boolean getBit(int bit)
          Read a single bit from the Bloom filter.
 double getFalsePositiveProbability()
          Get the current probability of a false positive.
 double getFalsePositiveProbability(double numberOfElements)
          Calculate the probability of a false positive given the specified number of inserted elements.
 int getK()
          Returns the value chosen for K.

K is the optimal number of hash functions based on the size of the Bloom filter and the expected number of inserted elements.
 int hashCode()
          Calculates a hash code for this class.
 void setBit(int bit, boolean value)
          Set a single bit in the Bloom filter.
 int size()
          Returns the number of bits in the Bloom filter.
 
Methods inherited from class java.lang.Object
getClass, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BloomFilter

public BloomFilter(int bitSetSize,
                   int expectedNumberOfFilterElements)
Constructs an empty Bloom filter.

Parameters:
bitSetSize - defines how many bits should be used for the filter.
expectedNumberOfFilterElements - defines the maximum number of elements the filter is expected to contain.
Method Detail

createHash

public static long createHash(java.lang.String val,
                              java.lang.String charset)
                       throws java.io.UnsupportedEncodingException
Generates a digest based on the contents of a String.

Parameters:
val - specifies the input data.
charset - specifies the encoding of the input data.
Returns:
digest as long.
Throws:
java.io.UnsupportedEncodingException - if the charset is unsupported.

createHash

public static long createHash(java.lang.String val)
                       throws java.io.UnsupportedEncodingException
Generates a digest based on the contents of a String.

Parameters:
val - specifies the input data. The encoding is expected to be UTF-8.
Returns:
digest as long.
Throws:
java.io.UnsupportedEncodingException - if UTF-8 is not supported.

createHash

public static long createHash(byte[] data)
Generates a digest based on the contents of an array of bytes.

Parameters:
data - specifies input data.
Returns:
digest as long.

equals

public boolean equals(java.lang.Object obj)
Compares the contents of two instances to see if they are equal.

Overrides:
equals in class java.lang.Object
Parameters:
obj - is the object to compare to.
Returns:
True if the contents of the objects are equal.

hashCode

public int hashCode()
Calculates a hash code for this class.

Overrides:
hashCode in class java.lang.Object
Returns:
hash code representing the contents of an instance of this class.

expectedFalsePositiveProbability

public double expectedFalsePositiveProbability()
Calculates the expected probability of false positives based on the number of expected filter elements and the size of the Bloom filter.

The value returned by this method is the expected rate of false positives, assuming the number of inserted elements equals the number of expected elements. If the number of elements in the Bloom filter is less than the expected value, the true probability of false positives will be lower.

Returns:
expected probability of false positives.

getFalsePositiveProbability

public double getFalsePositiveProbability(double numberOfElements)
Calculate the probability of a false positive given the specified number of inserted elements.

Parameters:
numberOfElements - number of inserted elements.
Returns:
probability of a false positive.

getFalsePositiveProbability

public double getFalsePositiveProbability()
Get the current probability of a false positive. The probability is calculated from the size of the Bloom filter and the current number of elements added to it.

Returns:
probability of false positives.

getK

public int getK()
Returns the value chosen for K.

K is the optimal number of hash functions based on the size of the Bloom filter and the expected number of inserted elements.

Returns:
optimal k.

clear

public void clear()
Sets all bits to false in the Bloom filter.


add

public void add(E element)
         throws java.io.UnsupportedEncodingException
Adds an object to the Bloom filter. The output from the object's toString() method is used as input to the hash functions.

Parameters:
element - is an element to register in the Bloom filter.
Throws:
java.io.UnsupportedEncodingException - if UTF-8 is unsupported.

addAll

public void addAll(java.util.Collection<? extends E> c)
            throws java.io.UnsupportedEncodingException
Adds all elements from a Collection to the Bloom filter.

Parameters:
c - Collection of elements.
Throws:
java.io.UnsupportedEncodingException - if UTF-8 is unsupported.

contains

public boolean contains(E element)
                 throws java.io.UnsupportedEncodingException
Returns true if the element could have been inserted into the Bloom filter. Use getFalsePositiveProbability() to calculate the probability of this being correct.

Parameters:
element - element to check.
Returns:
true if the element could have been inserted into the Bloom filter.
Throws:
java.io.UnsupportedEncodingException - if UTF-8 is unsupported.

containsAll

public boolean containsAll(java.util.Collection<? extends E> c)
                    throws java.io.UnsupportedEncodingException
Returns true if all the elements of a Collection could have been inserted into the Bloom filter. Use getFalsePositiveProbability() to calculate the probability of this being correct.

Parameters:
c - elements to check.
Returns:
true if all the elements in c could have been inserted into the Bloom filter.
Throws:
java.io.UnsupportedEncodingException - if UTF-8 is unsupported.

getBit

public boolean getBit(int bit)
Read a single bit from the Bloom filter.

Parameters:
bit - the bit to read.
Returns:
true if the bit is set, false if it is not.

setBit

public void setBit(int bit,
                   boolean value)
Set a single bit in the Bloom filter.

Parameters:
bit - is the bit to set.
value - If true, the bit is set. If false, the bit is cleared.

size

public int size()
Returns the number of bits in the Bloom filter. Use count() to retrieve the number of inserted elements.

Returns:
the size of the bitset used by the Bloom filter.

count

public int count()
Returns the number of elements added to the Bloom filter after it was constructed or after clear() was called.

Returns:
number of elements added to the Bloom filter.