I'm trying to convert a big HashMap<K, V> to Vec<(K, V)>. The usual way of doing it looks like this:
// initialize HashMap
let cap = 50000000;
let mut hm: HashMap<usize, usize> = HashMap::new();
for i in 0..cap {
hm.insert(i, i);
}
// convert HashMap to Vec
let vec = hm.into_iter().collect::<Vec<(usize, usize)>>();
This code does not work well if HashMap is big enough - in the beginning of the call to collect(), the original HashMap will still be in memory and Vec will be allocated with the capacity of lower size hint taken from the Iterator. This causes an out of memory panic for really big HashMaps, even though I should be able to convert between these two types with very little additional memory overhead. So far I have came up with the following solution:
// create small vector
let mut vec: Vec<(usize, usize)> = Vec::with_capacity(100);
for i in hm.into_iter() {
vec.push(i);
// reserve few megabytes
if vec.capacity() - vec.len() < 10 {
vec.reserve_exact(1000000);
}
}
Is there any better (more efficient or more idiomatic) approach to this problem? I am willing to use unsafe code if it were to improve performance.
Edit
As pointed out into_iter does not deallocate during iteration, so the proposed solution does not work as intended. Is there any other way of converting these collections apart from dumping HashMap to the file and then reading that file into Vec?
Allocating the exact amount needed up front is the memory- and time-efficient solution.
Say you want to create a vector with 100 items. If you were to allocate space for 50 items, when you go to add item 51, two possibilities exist:
It's not possible to know which case will happen, so you have to assume the worst.
This is one of the reasons that Iterator has the size_hint method: knowing how many items to allocate for is more efficient.
On the flip side, the HashMap likely stores the data in one big allocation as it's more efficient. This means that it's not possible (or maybe not easy / effective) to move one item out and then decrease the allocation. Even if you could do this, at the beginning of the copy you'd have both the entire HashMap and Vec allocated.
There are two possibilities I can think of that could improve the situation:
HashMap stores the data internally in a Vec, then potentially a method could be added to HashMap that returns that Vec after some last-minute sanitization.HashMap and/or Vec at all. For example, if you need to iterate over the data, you don't need to collect to a Vec first; just iterate over it.It seems that you're not satisfied with Vec's implementation of FromIterator trait. I don't know if it is reasonable to change it in std. However, you can introduce a wrapper for Vec and implement FromIterator as you wish:
#[derive(Debug)]
struct OptimizedVec<T>(Vec<T>);
impl<T> std::iter::FromIterator<T> for OptimizedVec<T> {
#[inline]
fn from_iter<I: IntoIterator<Item = T>>(iter: I) -> OptimizedVec<T> {
let mut vec = Vec::with_capacity(100);
for i in iter {
vec.push(i);
// reserve few megabytes
if vec.capacity() - vec.len() < 10 {
vec.reserve_exact(1000000);
}
}
OptimizedVec(vec)
}
}
//...
let vec: OptimizedVec<_> = hm.into_iter().collect();
The Vec value will be accessible as vec.0.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With