Understanding the Book-Crossing Dataset: Setup

I'm a data scientist at Treasure Data. In a series of blog entries, I want to introduce how to use our platform by interacting with a concrete dataset. I chose the publicly available Book-Crossing Dataset as our base data.

What’s in the Dataset?

The Book-Crossing Dataset consists of three logs: users, books and ratings. Let’s see what kind of data is inside.

When I study new logs (By the way, I love interacting with logs and data in general), the first question I ask is, are they "status" or "action"?

“Status” logs describe a state of an entity. In the current case, users and books are status logs. For example, users describe the age and location of each user while books record the title, author, year of publication and publisher of each book.

“Action” logs describe an action between (or among) entities. In the current case, ratings is an action log because in each line, a user (entity 1) give a rating (action) to a book (entity 2).

image

Setup!

Before we start analyzing the data, we should put it on Treasure Data Platform.

Get Treasure Data blogs, news, use cases, and platform capabilities.

Thank you for subscribing to our blog!

  • First, you should sign up and get an account.
  • You also need to download the command line client on the <a href=”http://toolbelt.treasure-data.com” target=”_blank”>tool belt page.
  • Finally, you need to download the dataset. I've put together a MessagePack formatted dataset for you.

Now that you have everything, let's create tables to which we can upload the data.

td help is your friend

You might say, “Wow, what are those commands?” Don’t fret, you can list all the commands with td help:all and td help shows all the options for the particular command. For example, here is all the options for td query, the command we will be using to query our uploaded data.

Writing some basic queries

Now, let's look at some data. One handy command to take a peek at data is td table:tail. Here is an example.

But that's just tailing your table. The command you want to use is the aforementioned td query. Here are a couple of examples.

That just looked up five users. Now, let's count how many users there are.

Wrap-up

That's it for today. In the next entry, we will analyze data using td query (with a bunch of pretty graphs!) Stay tuned!

Taka Inoue
Taka Inoue
Taka Inoue is Chief Data Scientist at Treasure Data. He began his career as one half data engineer (he founded the MongoDB User Group Japan) and one half statistician at a gaming studio. Taka's deep expertise of "real world" data science is a boon to our customers.
Related Posts