ANN: unicode-properties 3.2.0.0, unicode-names 3.2.0.0

unicode-properties 3.2.0.0, unicode-names 3.2.0.0 These two packages are representations in Haskell of various data in the Unicode 3.2.0 Character Database. Unicode 3.2.0 was the latest version of the Unicode standard at the time I wrote most of the code; later I may move the packages to the latest version (currently 5.1.0). The unicode-properties package contains functions to determine general category, case, and a wide range of other properties, as well as to do decomposition and case-folding. The unicode-names package contains just one function, getCharacterName, for getting the name of a character. It's separated out because it's a sufficiently large proportion of the total data. Both packages use the type "Char" to represent Unicode characters (more pedantically, codepoints). In GHC Char has the range ['\x0'..'\x10FFFF'], matching the Unicode standard. The packages won't work with compilers that restrict Char to a smaller range. Hackage: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/unicode-propertie... http://hackage.haskell.org/cgi-bin/hackage-scripts/package/unicode-names Source for both packages: http://code.haskell.org/unicode-properties/ Most of the data is auto-generated at build time from files downloadable from the Unicode web-site. I expect Don will have them both in Arch Linux within the hour. -- Ashley Yakeley

ashley:
unicode-properties 3.2.0.0, unicode-names 3.2.0.0
These two packages are representations in Haskell of various data in the Unicode 3.2.0 Character Database. Unicode 3.2.0 was the latest version of the Unicode standard at the time I wrote most of the code; later I may move the packages to the latest version (currently 5.1.0).
Arch Linux native packages available, http://aur.archlinux.org/packages.php?ID=19528 http://aur.archlinux.org/packages.php?ID=19527 Come on Debian, we need you! -- Don

On 2008 Sep 2, at 0:54, Ashley Yakeley wrote:
Both packages use the type "Char" to represent Unicode characters (more pedantically, codepoints). In GHC Char has the range ['\x0'..'\x10FFFF'], matching the Unicode standard. The packages won't work with compilers that restrict Char to a smaller range.
Hm. Are there any? I thought Unicode was required by H98. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

On Mon, Sep 01, 2008 at 09:54:38PM -0700, Ashley Yakeley wrote:
These two packages are representations in Haskell of various data in the Unicode 3.2.0 Character Database. Unicode 3.2.0 was the latest version of the Unicode standard at the time I wrote most of the code; later I may move the packages to the latest version (currently 5.1.0).
The unicode-properties package contains functions to determine general category, case, and a wide range of other properties, as well as to do decomposition and case-folding.
The unicode-names package contains just one function, getCharacterName, for getting the name of a character. It's separated out because it's a sufficiently large proportion of the total data.
On a minor point, it would probably be better to avoid prefixing names of constants (e.g. DCVertical). Also, the prefix "get" is usually reserved for functions that have a monadic effect, so names like decomposition :: Char -> Decomposition would be more usual than getDecomposition. Note that Data.Char already has functions generalCategory, toUpper, toLower and toTitle, which should work on the full range. It should probably have majorClass as well.

Ross Paterson wrote:
Note that Data.Char already has functions generalCategory, toUpper, toLower and toTitle, which should work on the full range.
It depends on whether libunicode is installed at the time GHC is built. You _might_ get some version of Unicode, or you might get ISO Latin-1 case folding instead. -- Ashley Yakeley

On Tue, Sep 02, 2008 at 02:51:45AM -0700, Ashley Yakeley wrote:
Ross Paterson wrote:
Note that Data.Char already has functions generalCategory, toUpper, toLower and toTitle, which should work on the full range.
It depends on whether libunicode is installed at the time GHC is built. You _might_ get some version of Unicode, or you might get ISO Latin-1 case folding instead.
No, it's OS-independent and doesn't use libunicode. The implementation of these four functions in cbits/WCsubst.c is generated from UnicodeData.txt.
participants (4)
-
Ashley Yakeley
-
Brandon S. Allbery KF8NH
-
Don Stewart
-
Ross Paterson