Wrangling Huge .CSVs Made Simple (with a little help from SQL)

Wrangling Huge .CSVs Made Simple (with a little help from SQL)

Let’s say you have a very large .csv (containing more than 80k rows) that you want to analyze and visualize.  Usually, given only Excel and Tableau (or Looker, Chartio, etc.), you would have one of two options:

  1. Put it into Excel and run pivot tables, charts, and whatnot, OR
  2. Ingest the data directly into Tableau and try to manipulate it there.

While both tools are excellent choices for their intended use cases, a large enough load would cause Excel to fall down pretty quickly and would similarly become unmanageable to consume directly into a visualization tool like Tableau.

It isn’t scalable to try to ingest this volume of data into a relational database as an intermediary step, either – especially if your data schema isn’t fixed.   That’s why you need a system like Treasure Data that scales automatically for you and provides SQL schema-onread access with schema flexibility and direct connectivity to Tableau.

Data-Munging

Let’s look at how to do this.

Note:  These instructions assume that you’ve already set up td toolbelt and have a Treasure Data account.

Preview Your .CSV File

First, you should check the contents of your .csv file. By way of example, I’ll use a fairly large .csv (with >88k rows) containing the NASDAQ index for the last 24 hours. (We can handle much, much bigger files as well!)

Get Treasure Data blogs, news, use cases, and platform capabilities.

Thank you for subscribing to our blog!

$ head -n 5 nasdaq.csv

head_nasdaq_csv

Ingest Your .CSV into Treasure Data (Using TD Toolbelt)

You’ll need somewhere to ingest the data. Create a database and a table on Treasure Data.

$ td db:create stocks
$ td table:create stocks nasdaq

create_db_create_table

Then, execute the following commands to upload the .csv file. (NOTE: this will take some time!)

$ td import:auto 
  --format csv --column-header 
  --time-column time 
  --time-format "%Y-%m-%d %H:%M:%S" 
  --auto-create stocks.nasdaq 
  ./nasdaq.csv

upload_csv

Query Your Data into the Format You Want to Visualize

For our visualization, we’ll run a count: using Treasure Data Console, we’ll tally the volume of trades we get by symbol; group the summed volumes by symbol, and sort the symbols in descending order by volume:

SELECT 
  SUM(volume) AS COUNT,
  symbol
FROM
  nasdaq
GROUP BY
  symbol
ORDER BY
  COUNT DESC

Click the GIF to view the animation.

visualize_query_1

LIMIT Your Query

That’s a lot of records to try to visualize (2832 to be exact!) so let’s get the top 20 NASDAQ symbols by volume from our dataset.

SELECT 
  SUM(volume) AS COUNT,
  symbol
FROM
  nasdaq
GROUP BY
  symbol
ORDER BY
  COUNT DESC LIMIT 20

Note that the query result is now limited to the top 20 symbols by volume. (Apple is only #3? Really?)

Export Your Data to Tableau

Now that we have that bit working, let’s export our result to a datasource in Tableau. To do this, we’ll fill in the Result Export dialog for Tableau Server.

NOTE: you’ll need a Tableau Server Online account for this step.

For my purposes, I’m using SSL, and Requiring a valid SSL certificate. ‘Datasource’ in this case refers to the one Tableau gets the data from. I’m calling mine ‘nasdaq_demo_john’; for demos, I nearly always use ‘replace’ mode.

Click the GIF to view the animation.

result_export_tableau

Visualizing Your Data in Tableau

After we log into Tableau Server, we’ll go to home -> data sources and filter for our ‘nasdaq_demo_john’ data source, which we’ll select.

From there, we’ll create a new workbook and drag the following items to build our visualization:
columns: symbol
rows: SUM(count)

Click the GIF to view the animation.

viz_data_tableau

And so…

Want to get the best out of your data analytics infrastructure with the minimum of expense, hassle, and configuration? Then try out Treasure Data.

Our universal data connectors can route almost any data source to almost any data destination. Want an integration we don’t support yet? Ask us!

Sign up for our 14-day trial today or request a demo! You can also reach out to us at sales@treasuredata.com!

John Hammink
John Hammink
John Hammink is Chief Evangelist for Treasure Data. An 18-year veteran of the technology and startup scene, he enjoys travel to unusual places, as well as creating digital art and world music.