Changing data types


#1

I spend a lot of my time dealing with data in different formats which I need to be able to convert into a different format, (eg converting a list of maps into a map of maps) so that I can do something with it. A topic on changing between different data formats would be really useful.

Please vote. Comments are encouraged!

  • :thumbsup: Yes, please teach this!
  • :thumbsdown: No, I’m not interested.

0 voters


#2

Great topic! Data transformation is huge! So important and so many things to know.

I have plenty to say but I want to make sure I’m hitting the things you’re having trouble with. It would help me if you had a couple of example problems you’re dealing with. Otherwise, I’ll just talk about what I think is important.

So, if you (@draven72) or anyone else has some specific things you want out of this topic, please reply!


#3

Hi Eric,

Here are some examples of data structures I’ve been dealing with recently from the Google Analytics and Facebook APIs:

Google Analytics

{:total-results 5, :columns
[{“columnType” “DIMENSION”, “dataType” “STRING”, “name” “ga:SourceMedium”}
{“columnType” “METRIC”, “dataType” “INTEGER”, “name” “ga:sessions”}],
:sampled? false,
:records ({:name “ga:SourceMedium”, :column-type “DIMENSION”, :value “(direct) / (none)”}
{:name “ga:sessions”, :column-type “METRIC”, :value 2})
({:name “ga:SourceMedium”, :column-type “DIMENSION”, :value “ask / organic”}
{:name “ga:sessions”, :column-type “METRIC”, :value 1})
({:name “ga:SourceMedium”, :column-type “DIMENSION”, :value “bing / organic”}
{:name “ga:sessions”, :column-type “METRIC”, :value 1})
({:name “ga:SourceMedium”, :column-type “DIMENSION”, :value “google / organic”}
{:name “ga:sessions”, :column-type “METRIC”, :value 19})
({:name “ga:SourceMedium”, :column-type “DIMENSION”, :value “yahoo / organic”}
{:name “ga:sessions”, :column-type “METRIC”, :value 2})}

Facebook

({:name “post_story_adds_unique”, :values [{:value 3}], :id “12345_67890/insights/post_story_adds_unique/lifetime”} {:name “post_story_adds”, :values [{:value 3}], :id “12345_67890/insights/post_story_adds/lifetime”} {:name “post_impressions_by_paid_non_paid”, :values [{:value {:total 2624, :unpaid 2624, :paid 0}}], :id “12345_67890/insights/post_impressions_by_paid_non_paid/lifetime”} {:name “post_video_length”, :values [{:value{}}], :id “12345_67890/insights/post_video_length/lifetime”} {:name “post_video_avg_time_watched”, :values [{:value 0}], :id “12345_67890/insights/post_video_avg_time_watched/lifetime”} {:name “post_consumptions_unique”, :values [{:value 29}], :id “12345_67890/insights/post_consumptions_unique/lifetime”} {:name “post_consumptions_by_type”, :values [{:value {:other clicks 6, :link clicks 27}}], :id “12345_67890/insights/post_consumptions_by_type/lifetime”} {:name “post_negative_feedback_unique”, :values [{:value 2}], :id “12345_67890/insights/post_negative_feedback_unique/lifetime”})

The Facebook response is really unpleasant to work with as they include spaces in some of their keys, and if a key hasn’t had any clicks it’s returned as an empty map or zero!

What I’ve been working on doing is pulling out the data into a flat file format so that it can be exported as a csv and then imported into a database. Usually I try and get the data into a map of maps, and then map Juxt over the data. For large amounts of data I use Spark, so will then split the data and save it as a text file. For smaller data sets I’m using clojure.data.csv to save the file as a csv.

Regards

Ben


#4

Hi @draven72,

This is a really good question and I haven’t had a chance to answer it to my satisfaction yet. I thought I would but I haven’t.

One pattern I use when I’m cleaning up other people’s data is to make an explicit cleanup function that just normalizes everything. I just keep adding to it until everything is clean.

(defn clean-row [row]
  (-> row
    remove-spaces-in-keys
    convert-0-to-empty-map
    ...))

Then you can map it over all of your rows and only deal with the cleaned up rows.


#5

Thanks Eric, that’s a good idea. I’m working on building a data pipeline for various different sources, so this will come in useful. I’ll also be revisiting the Protocol videos, as I’ve already seen that each source appears to be using a different date format!