If your data is originating from a DB, read the DB schema and use code-gen or TH to generate your record structure. Please confirm that your Haskell data pipeline is able to handle 100-field+ records beforehand. I have a strange feeling that some library or the other is going to break at the 64-field mark.

If you don't have access to the underlying DB, read the CSV header and code-gen your data structures. This will still lead to a lot of boilerplate because your code-gen script will need to maintain a col-name<>data-type mapping. See if you can peek at the first row of the data and take an educated guess about each column's data-type based on the column values. This will not be 100% accurate, but you can get good results by manually specifying only a few data-types instead of the entire 100+ data-types.

-- Saurabh.

On Sun, Oct 1, 2017 at 4:38 PM, Leandro Ostera <leandro@ostera.io> wrote:
Two things come to mind.

The first one is *Crazy idea, bad pitch*: generate the record code from the data.

The second is to make the records dynamically typed:

Would it be simpler to define a Column type you can parameterize with a string for its name (GADTs?) so you automatically get a type of that specific column?

That way as you read the CSV files you could define the type of the columns based on the actual column name.

Rows would then become sets of pairings of defined columns and values, perhaps having a Maybe would encode that any given value for a particular column is missing. You could encode these pairings a list too.

At least there you can have type guarantees that you’re joining fields that are of the same column type. I think.

Either way, my 2 cents and keep it up!


sön 1 okt. 2017 kl. 03:34 skrev Guru Devanla <gurudev.devanla@gmail.com>:
Hello All,

I am in the process of replicating some code in Python in Haskell.

In Python, I load a couple of csv files, each file having more than 100 columns into a Pandas' data frame. Panda's data-frame, in short is a tabular structure which lets me performs on bunch of joins, and filter out data. I generated different shapes of reports using these operations. Of course, I would love some type checking to help me with these merge, join operations as I create different reports.
 
I am not looking to replicate the Pandas data-frame functionality in Haskell. First thing I want to do is reach out to the 'record' data structure. Here are some ideas I have:

1.  I need to declare all these 100+ columns into multiple record structures.
2.  Some of the columns can have NULL/NaN values. Therefore, some of the attributes of the record structure would be 'MayBe' values. Now, I could drop some columns during load and cut down the number of attributes i created per record structure.
3.  Create a dictionary of each record structure which will help me index into into them.'

I would like some feedback on the first 2 points. Seems like there is a lot of boiler plate code I have to generate for creating 100s of record attributes. Is this the only sane way to do this?  What other patterns should I consider while solving such a problem. 

Also, I do not want to add too many dependencies into the project, but open to suggestions.

Any input/advice on this would be very helpful.

Thank you for the time!
Guru
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.



--
http://www.saurabhnanda.com