
Sure, Huffman was actually my first tought. But I couldn't think of a pratical display for the result of Huffman encoding that could be easily followed by a human looking at the screen. Since it's an optimal code, letters would not be grouped in alphabetical order.
There is a compromise. There is such a thing as an ORDERED Huffman code. Consider a set of strings. If they call begin with the same first letter, assume that letter and consider the suffixes instead. Otherwise, choose a letter L such that as close as possible to half of the strings begin with a letter preceding L in the alphabet as close as possible to half of the strings begin with the letter L or a later letter.
I believe that's what I've done. I use this file to give weight to letters: http://bitbucket.org/mauricio/bitspeak/src/tip/src/Corpora.hs After each letter is selected I keep only sufixes after that letter, and then I use 'splitData' to find the point where I'll separate both halves: http://bitbucket.org/mauricio/bitspeak/src/tip/src/Main.hs Best, Maurício