“In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.” – @BigDataBorat
On July 1, Treasure Data hosted the latest round of the Silicon Valley Data Engineering meetup, “Pandas: Lessons Learned for Open-Source Data Tools with Chang She”.[youtube https://www.youtube.com/watch?v=uGoThzE6EwA&w=560&h=315]
The talk was given by Chang She, at Cloudera, one of the best known data engineers around and a core contributor to Pandas.
Starting on the premise that data can be “..too small to be big data, but too big to be stupid about it,” we are taken through a Pandas crash course. The talk touched on several areas, including:
- Pandas API
- Data Structures
Initially built for financial quants, the product has expanded to embrace software engineers, data engineers and, generally, a broader base of users. Since the Pandas workflow encompasses core analytics, data munging, ingestion, and publishing/plotting, it’s not hard to imagine that the history of this toolset has had many interesting twists and turns!
One of the interesting problems of the tool is the problem of convenience vs. parameter gravity-well; that is, there appear to be opposing pulls in the community between functions being convenient swiss-army knives vs. keeping them concise and readable.
During the Q & A, we learned many interesting lessons (and heard many interesting stories) about Pandas evolution.
Join us at Silicon Valley Data Engineering!
There are more events coming in the Silicon Valley Data Engineering group! Next up: Fluentd, Docker and All That: Logging Infrastructure 2015 Edition on Thursday, July 23rd!