Is it safe to index a little bit out of bounds

Let's say I have a gc-managed byte array of length 19. GHC promises that byte arrays are machine-word-aligned on the front end. That is, on a 64-bit machine, this array starts on a memory address that divide 8 evenly. However, the back end will certainly be unaligned. So, these two calls will be fine: - indexWordArray# myArr# 0# - indexWordArray# myArr# 1# But this one is non-deterministic: - indexWordArray# myArr# 2# Some of the bytes in the word will have garbage in them. However, this could always be masked out with a bit mask (you have to know the platform endianness for this to work right). Is this safe? I doubt think this could ever cause a segfault but I wanted to check. -- -Andrew Thaddeus Martin

2018-03-08 15:19 GMT+01:00 Andrew Martin
[...] Some of the bytes in the word will have garbage in them. However, this could always be masked out with a bit mask (you have to know the platform endianness for this to work right). Is this safe? I doubt think this could ever cause a segfault but I wanted to check.
Before doing such things, please make sure that e.g. valgrind or similar tools are happy with such Kung-Fu. I don't know off the top of my head how fine-grained their checks are, but there is various similar code out there in the wild which is a PITA to debug. You might force people to add suppressions or even worse: Make some valuable tools totally useless. This is not something which should be done lightly...

Hi, On 2018-03-08 at 09:19:29 -0500, Andrew Martin wrote:
Some of the bytes in the word will have garbage in them. However, this could always be masked out with a bit mask (you have to know the platform endianness for this to work right).
Is this safe? I doubt think this could ever cause a segfault but I wanted to check.
Due to historical reasons, this is indeed safe. the underlying `StgArrBytes` structure must be word-aligned in size, otherwise bad things are likely to happen. I've seem some code in the wild which relies on that, and as data-point, I myself exploit that property in some operations (including the masking and endianness-aware handling you refer to) of 'text-short'[1] which is optimised for UTF8-based strings (<shameless-plug>and which besides being a practically useful library having its place in the text/bytearray landscape[2], text-short also serves as an incubation area for optimisation ideas and code of which some may end up in one way or another in the text-utf8 project[3]</shameless-plug>). [1]: https://hackage.haskell.org/package/text-short [2]: https://markkarpov.com/post/short-bs-and-text.html [3]: https://hackage.haskell.org/text-utf8 -- hvr

Thanks Herbert! This is exactly the kind of data point I was looking for.
Good to know.
On Thu, Mar 8, 2018 at 12:42 PM, Herbert Valerio Riedel
Hi,
On 2018-03-08 at 09:19:29 -0500, Andrew Martin wrote:
Some of the bytes in the word will have garbage in them. However, this could always be masked out with a bit mask (you have to know the platform endianness for this to work right).
Is this safe? I doubt think this could ever cause a segfault but I wanted to check.
Due to historical reasons, this is indeed safe. the underlying `StgArrBytes` structure must be word-aligned in size, otherwise bad things are likely to happen.
I've seem some code in the wild which relies on that, and as data-point, I myself exploit that property in some operations (including the masking and endianness-aware handling you refer to) of 'text-short'[1] which is optimised for UTF8-based strings (<shameless-plug>and which besides being a practically useful library having its place in the text/bytearray landscape[2], text-short also serves as an incubation area for optimisation ideas and code of which some may end up in one way or another in the text-utf8 project[3]</shameless-plug>).
[1]: https://hackage.haskell.org/package/text-short
[2]: https://markkarpov.com/post/short-bs-and-text.html
[3]: https://hackage.haskell.org/text-utf8
-- hvr
-- -Andrew Thaddeus Martin

What do you gain from this?
On Mar 8, 2018 9:19 AM, "Andrew Martin"
Let's say I have a gc-managed byte array of length 19. GHC promises that byte arrays are machine-word-aligned on the front end. That is, on a 64-bit machine, this array starts on a memory address that divide 8 evenly. However, the back end will certainly be unaligned. So, these two calls will be fine:
- indexWordArray# myArr# 0# - indexWordArray# myArr# 1#
But this one is non-deterministic:
- indexWordArray# myArr# 2#
Some of the bytes in the word will have garbage in them. However, this could always be masked out with a bit mask (you have to know the platform endianness for this to work right). Is this safe? I doubt think this could ever cause a segfault but I wanted to check.
-- -Andrew Thaddeus Martin
_______________________________________________ Libraries mailing list Libraries@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries

If you are looking for ascii (or non-ascii characters) in a byte array, you
build a word-sized mask like 0b1000000010000000... However, on the last
word, if you cannot go past the end, you have to go one byte at a time.
But, if you can go past the end, you can mask out the irrelevant bits and
use the same mask as before.
On Thu, Mar 8, 2018 at 1:35 PM, David Feuer
What do you gain from this?
On Mar 8, 2018 9:19 AM, "Andrew Martin"
wrote: Let's say I have a gc-managed byte array of length 19. GHC promises that byte arrays are machine-word-aligned on the front end. That is, on a 64-bit machine, this array starts on a memory address that divide 8 evenly. However, the back end will certainly be unaligned. So, these two calls will be fine:
- indexWordArray# myArr# 0# - indexWordArray# myArr# 1#
But this one is non-deterministic:
- indexWordArray# myArr# 2#
Some of the bytes in the word will have garbage in them. However, this could always be masked out with a bit mask (you have to know the platform endianness for this to work right). Is this safe? I doubt think this could ever cause a segfault but I wanted to check.
-- -Andrew Thaddeus Martin
_______________________________________________ Libraries mailing list Libraries@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
-- -Andrew Thaddeus Martin
participants (4)
-
Andrew Martin
-
David Feuer
-
Herbert Valerio Riedel
-
Sven Panne