


Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
This problem is known as “primary clustering”. Happens when λ is large, or if we get unlucky. In linear probing, we expect to get O (lg(n)) size ...
Typology: Study notes
1 / 4
This page cannot be seen from the preview
Don't miss anything!
Michael Lee Friday, Jan 26, 2018
1
With your neighbor, discuss and review:
I How do we implement get , put , and remove in a hash table using separate chaining? I (^) What about in a hash table using open addressing with linear probing? I Compare and contrast your answers: what do we do the same? What do we do differently?
2
In both implementations, for all three methods, we start by finding the initial index to consider : index = key.hashCode() % array.length
3
If we’re using separate chaining, we then search/insert/delete from the bucket: IDictionary<K, V> bucket = array[index] bucket.get(key) // or .put(...) or .remove(...) ...and resize when λ ≈ 1. (When exactly to resize is a tuneable parameter)
4
If we’re using linear probing, search until we find an array element where the key is equal to ours or until the array index is null: while (array[index] != && array[index].hashcode != key.hashCode() null && !array[index].equals(key)) { index = (index + 1) % this .array.length } if (array[index] == null ) // throw exception if implementing get // add new key-value pair if implementing put else // return or set array[index]
How do we delete? (complicated, see section 04 handouts) When do we resize?
Strategy: Linear probing If we collide, checking each next element until we find an open slot. So, h ′( k , i ) = ( h ( k ) + i ) mod T , where T is the table size i = 0 while (index in use) try (hash(key) + i) % array.length i += 1
Assume internal capacity of 10, insert the following keys:
38, 19, 8, 109, 10
0 1 2 3 4 5 6 7 8 9
What’s the problem? Lots of keys close together: a “cluster”. We ended up having to probe many slots! 7
Primary clustering When using linear probing, we sometimes end up with a long chain of occupied slots. This problem is known as “primary clustering”
Happens when λ is large, or if we get unlucky In linear probing, we expect to get O (lg( n )) size clusters.
8
Questions:
I When is performance good? When is it bad? Runtime is bad when table is nearly full. Runtime is also bad when we hit a “cluster” I What is the maximum load factor? Load factor is at most λ = 1. 0! I When do we resize?
9
Punchline: clustering can be potentially bad, but in practice, it tends to be ok as long as λ is small 10
Question: when do we resize? Usually when λ ≈ (^12)
Nifty equations: I (^) Average number of probes for successful probe: 1 2
(1 − λ)
I (^) Average number of probes for unsuccessful probe: 1 2
(1 + λ)^2
*These equations aren’t important to know
Problem: We can still get unlucky/somebody can feed us a malicious series of inputs that causes several slowdown Can we pick a different collision strategy that minimizes clustering? Idea: Rather then probing linearly, probe quadratically! Exercise: assume internal capacity of 10, insert the following:
89, 18, 49, 58, 79 0 1 2 3 4 5 6 7 8 9
How many different probe sequences are there?
There are T different starting positions, T − 1 different jump intervals (since we can’t jump by 0), so there are O ( T^2 )^ different probe sequences
Result: in practice, double-hashing is very effective and commonly used “in the wild”.
19
So, what strategy is best? Separate chaining? Open addressing? No obvious answer: both implementations are common. Separate chaining:
I Don’t have to worry about clustering I Potentially more “compact” (λ can be higher)
Open addressing:
I Managing clustering can be tricky I (^) Less compact (we typically keep λ < 12 ) I Array lookups tend to be a constant factor faster then traversing pointers
20
Can we use hash functions for more then just dictionaries?
Yes! Lots of possible applications, ranging from cryptography to biology.
Important: Depending on the application, we might want our hash function to have different properties.
21
How would you implement the following using hash functions? For each application, also discuss what properties you want your hash function to have.
I (^) Suppose we’re sending a message over the internet. This message might become mildly corrupted. How can we detect if corruption probably occurred? I Suppose you have many fragments of DNA and want to see where they appears in a (significantly longer) segment of DNA. How can we do this efficiently?
22
Same question as before:
I Suppose you’re designing an video uploading site and want to detect if somebody is uploading a pirated movie. A naive way to do this is to check if the movie is byte-for-byte identical to some movie. How can we do this more efficiently? I (^) Suppose you’re designing a website with a user login system. Directly storing your user’s passwords is dangerous – what if they get stolen? How can you store password in a safe way so that even if they’re stolen, the passwords aren’t compromised?
Same question as before:
I (^) You are trying to build an image sharing site. Users upload many images, and you need to assign each image some unique ID. How might you do this? I Suppose we have a long series of financial transactions stored on some (potentially untrustworthy) computer. Somebody claims they made a specific transaction several months ago. Can you design a system that lets you audit and determine if they’re lying or not? Assume you have access to just the very latest transaction, obtained from a different trustworthy source.