Re: [Haskell-cafe] Data.Map - visiting tree nodes withi a given key range ?

You can roll your own indexKeyRange using the Data.Map.Internal module which exposes the (currently used) Map implementation. Also note that if the list of values in range is to be consumed immediately, you might want to go for a fold-based function: foldlWithRange :: (a -> k -> b -> a) -> (a,a) -> b -> Map k a -> b Olaf

Thanks Olaf, Can you point me to the specific function for key range traversal? I went over the module doc for Data.Map.Internal and Data.Map.Strict.Internal twice, yet still don't get which one supposed to work for me, or I should ignore doc and look at the code instead ? And the values to be scanned in specific key range are going to be consumed by some CPS mutual iterator, so fold can't be used as I see it. Best regards, Compl On 2020/3/15 下午9:37, Olaf Klinke wrote:
You can roll your own indexKeyRange using the Data.Map.Internal module which exposes the (currently used) Map implementation. Also note that if the list of values in range is to be consumed immediately, you might want to go for a fold-based function:
foldlWithRange :: (a -> k -> b -> a) -> (a,a) -> b -> Map k a -> b
Olaf

Dear Compl, there is no such function in the Data.Map.Internal module. You have to decompose the map structure yourself. import Data.Map.Internal -- name clash with Control.Monad when :: (Monoid b) => Bool -> b -> b when t b = if t then b else mempty contains :: Ord k => (k,k) -> k -> Bool contains (lbound,ubound) k = lbound <= k && k <= ubound foldRange :: (Monoid b, Ord k) => (a -> b) -> (k,k) -> Map k a -> b foldRange f range@(lbound,ubound) m = case m of Tip -> mempty (Bin _ k a left right) -> foldLeft <> this <> foldRight where foldLeft = when (lbound < k) (foldRange f range left) this = when (range `contains` k) (f a) foldRight = when (k < ubound) (foldRange f range right) -- verify that only the range is processed
let m = fromList $ zip [1..] [undefined,"bar","baz",undefined] foldRange (\a -> [a]) (2,3) m ["bar","baz"]
Am 15.03.2020 um 18:30 schrieb Compl Yue
: Thanks Olaf,
Can you point me to the specific function for key range traversal? I went over the module doc for Data.Map.Internal and Data.Map.Strict.Internal twice, yet still don't get which one supposed to work for me, or I should ignore doc and look at the code instead ?
And the values to be scanned in specific key range are going to be consumed by some CPS mutual iterator, so fold can't be used as I see it.
Best regards,
Compl
On 2020/3/15 下午9:37, Olaf Klinke wrote:
You can roll your own indexKeyRange using the Data.Map.Internal module which exposes the (currently used) Map implementation. Also note that if the list of values in range is to be consumed immediately, you might want to go for a fold-based function:
foldlWithRange :: (a -> k -> b -> a) -> (a,a) -> b -> Map k a -> b
Olaf

Thanks so much Olaf! I think `foldRange` should already work for my case without concerns. And more importantly this example builds the intuition to get me started in touching Map internals, I was hesitating with fear before :p That said, I have another improvement postponed, which I assumed much tougher. That is fast re-indexing at the time some of a business object's key attributes change. I currently use another HashMap to associate the old IndexKey of an Object with itself, on re-indexing , I just lookup the old IndexKey from the HashMap, and delete the resulted key from the tree Map, before putting the Object with new IndexKey into tree Map. This works but not reasonably efficient. I have an idea in my mind that insertion operation into the tree Map could return a fast-entry-remover function with sufficient internal structure captured, so instead of going the O(log n) deletion by old key (also re-balancing effort to be added), this remover function can be reverse-lookup'ed by Object, then applied to have its rival entry removed from a later version of the tree Map. Yet better to upgrade the remover into a replacer, that inserts an entry with the new IndexKey of the Object in a single pass. Changing a later field on a multi-field index should produce the new IndexKey sufficiently nearer to the old key in tree Map's respect, so I expect amortized complexity to be greatly reduced, especially when the number of indexed objects is very large. A bit context for clarity: an Object value is identified by an embedded Data.Unique field, and has a mutable Control.Concurrent.STM.*TVar* http://localhost:8080/file/home/cyue/.stack/programs/x86_64-linux/ghc-8.6.5/... pointer to its attributes. I'm not aware of existing codebase doing that, maybe I'll be extending Data.Map to do that, but it'll be excellent to hear about your insights about tackling this improvement. I still need to finish PoC of the dababase first, no hurry for performance improvement atm, as long as it's working correctly. Best regards, Compl On 2020/3/16 上午2:42, Olaf Klinke wrote:
Dear Compl,
there is no such function in the Data.Map.Internal module. You have to decompose the map structure yourself.
import Data.Map.Internal
-- name clash with Control.Monad when :: (Monoid b) => Bool -> b -> b when t b = if t then b else mempty
contains :: Ord k => (k,k) -> k -> Bool contains (lbound,ubound) k = lbound <= k && k <= ubound
foldRange :: (Monoid b, Ord k) => (a -> b) -> (k,k) -> Map k a -> b foldRange f range@(lbound,ubound) m = case m of Tip -> mempty (Bin _ k a left right) -> foldLeft <> this <> foldRight where foldLeft = when (lbound < k) (foldRange f range left) this = when (range `contains` k) (f a) foldRight = when (k < ubound) (foldRange f range right)
-- verify that only the range is processed
let m = fromList $ zip [1..] [undefined,"bar","baz",undefined] foldRange (\a -> [a]) (2,3) m ["bar","baz"]
Am 15.03.2020 um 18:30 schrieb Compl Yue
: Thanks Olaf,
Can you point me to the specific function for key range traversal? I went over the module doc for Data.Map.Internal and Data.Map.Strict.Internal twice, yet still don't get which one supposed to work for me, or I should ignore doc and look at the code instead ?
And the values to be scanned in specific key range are going to be consumed by some CPS mutual iterator, so fold can't be used as I see it.
Best regards,
Compl
On 2020/3/15 下午9:37, Olaf Klinke wrote:
You can roll your own indexKeyRange using the Data.Map.Internal module which exposes the (currently used) Map implementation. Also note that if the list of values in range is to be consumed immediately, you might want to go for a fold-based function:
foldlWithRange :: (a -> k -> b -> a) -> (a,a) -> b -> Map k a -> b
Olaf

By the way, there are tools to retrieve a certain range from compressed data, which IMHO is a very cool feature of gzip. https://www.htslib.org/doc/bgzip.html https://www.htslib.org/doc/tabix.html Bioinformaticians use it (among other things) for fast retrieval of genomic annotation from data sets with on the order of 10^9 keys (in case of human genome). Would be nice if someone wrote a Haskell binding. Olaf
Hello,
I have a question regarding 'Data.Map' api, filed an issue https://github.com/haskell/containers/issues/708
And may be I can ask here at the same time?
I'm not sure why|Data.Map|doesn't have a key range based visiting API, I figured out I can do it this way:
|indexKeyRange :: IndexKey -> IndexKey -> Map IndexKey Object -> [(IndexKey, Object)] indexKeyRange !minKey !maxKey = toList . takeWhileAntitone (<= maxKey) . dropWhileAntitone (< minKey) |
But wouldn't it save the computation needed to re-balance the intermediate tree generated ? Or that re-balancing can be optimized out actually ?
I am creating an in-memory graph database, using|Data.Map.Strict.Map|as business object indices with specified object attributes. The typical scenario will be querying a small number of entries by key range, out of possibly all business objects of a certain class globally, so the implementation above would work, but not reasonable by far as it seems.
I think a lazy list returned by mere node visiting (i.e. no new node creation) would satisfy my needs, or I missed something ?
Thanks,
Compl

Maybe off topic, my work environment deals with datasets sized 20~200 Giga Bytes, consisting of small time series arrays mostly, I offload the compression (and dedup) work to ZFS (by deploying a SmartOS storage server managing a dozen of spinning disks with several TB capacity. Many computing nodes run a local FUSE mount viewing those data files over network, as if being part of a virtual large data file in local filesystem, and access (mostly reads, small fraction of writes) the data via mmap. This way, parallel processes run on multi CPU cores of a single computing node share the OS' kernel page for cache of the dataset, a program just assumes random access to the whole dataset as available at somewhere within its virtual address space. Giving just enough physical RAM (in order to prevent thrashing) to the storage server and computing nodes (my env currently have a typical size of 128GB per node), this achieves both simplicity of programming and efficient use of processor/memory/storage resources. This architecture should scale well to datasets of a few TBs. On 2020/3/16 上午3:46, Olaf Klinke wrote:
By the way, there are tools to retrieve a certain range from compressed data, which IMHO is a very cool feature of gzip.
https://www.htslib.org/doc/bgzip.html https://www.htslib.org/doc/tabix.html
Bioinformaticians use it (among other things) for fast retrieval of genomic annotation from data sets with on the order of 10^9 keys (in case of human genome). Would be nice if someone wrote a Haskell binding.
Olaf
Hello,
I have a question regarding 'Data.Map' api, filed an issue https://github.com/haskell/containers/issues/708
And may be I can ask here at the same time?
I'm not sure why|Data.Map|doesn't have a key range based visiting API, I figured out I can do it this way:
|indexKeyRange :: IndexKey -> IndexKey -> Map IndexKey Object -> [(IndexKey, Object)] indexKeyRange !minKey !maxKey = toList . takeWhileAntitone (<= maxKey) . dropWhileAntitone (< minKey) |
But wouldn't it save the computation needed to re-balance the intermediate tree generated ? Or that re-balancing can be optimized out actually ?
I am creating an in-memory graph database, using|Data.Map.Strict.Map|as business object indices with specified object attributes. The typical scenario will be querying a small number of entries by key range, out of possibly all business objects of a certain class globally, so the implementation above would work, but not reasonable by far as it seems.
I think a lazy list returned by mere node visiting (i.e. no new node creation) would satisfy my needs, or I missed something ?
Thanks,
Compl
participants (2)
-
Compl Yue
-
Olaf Klinke