StrataConf by the Numbers: As Hadoop Stoops, Machines Learn and Spark
With my third Strata + HadoopWorld conference around the corner, I decided to do a bit of data analysis to search for any overarching trends for this massive convention by/for/of data nerds.
As a former data scientist (okay, “quantitative analyst” in finance counts, right?), I decided to collect data from the previous eight Strata programs (four Strata and four Strata NY) for any trends. Here are my findings, some more surprising than others. If you are interested in how I got the data, skip to the “methodology” section.
Hadoop in Decline?
My dataset consists of 1,448 Strata and Strata NY talks culled from 2011-2014 with their titles.
The first thing I looked at is the total number of talks per year.
As you can see, the number of talks is increasing every year. Actually, there will be 248 sessions at Strata 2015, up from 222 a year ago. But also, we can see the biggest jump came between 2011 and 2012.
So, what are these 1,448 talks about? I didn’t have the time to read the abstracts of all 1,448, so I decided to skim their titles. And by “skimming their titles,” I mean writing regular expressions to search for phrases. The first pair I searched for was “Big Data” and “Hadoop,” arguably the two biggest buzzwords that define today’s data profession. How often do “Big Data” and “Hadoop” appear in the title of the talks?
At first glance, it looks like “Hadoop” has peaked in 2013 and is now in decline while “Big Data” is still going strong. However, the reader with a healthy dose of skepticism might point out that because the total number of talks is different year to year, you can’t compare one year to another just by looking at the number of talks. Instead, what we should look at is percentage. In other words, what fraction of the talks contains “Big Data” or “Hadoop” each year? Here is the revised graph:
Oh, no! It looks like “Big Data” has been on the decline since the beginning. But is that really the case?
There are many explanations why a smaller and smaller fraction of talks contains phrases like “Big Data” and “Hadoop.” One extreme explanation is that we are at the end of a hype cycle, and soon enough all Hadoop vendors will go out of business as people wake up from their Big Data dreams (and Hadoop nightmares).
But a more plausible hypothesis is the maturity of Big Data and related technologies. With so many talks, books, blogs and seminars on Big Data and Hadoop, most Strata participants should be familiar with the basics of Big Data and Hadoop by now; consequently, there are fewer talks covering “Big Data” or “Hadoop” explicitly because they would be redundant for the audience.
There is one interesting observation here: Big Data came a good two years before Hadoop as represented by Strata session titles. Today, it almost seems that Hadoop and Big Data are synonymous, but that certainly was not the case in 2011.
I should also note that Small Data seems to be the next big thing after Big Data; 2014 was the first time “Small Data” showed up in talk titles albeit just a few:
- Small Data in Sports: Little Differences that Mean Big Outcomes
- The Sidekick Pattern: Using Small Data to Increase the Value of Big Data
- Small Data Problems
If you actually read the linked abstracts, you realize that “Small Data Problems” and “Small Data in Sports” discuss Small Data in counterpoint to Big Data, whereas “The Sidekick Pattern” is actually about processing Big Data more effectively by looking at smaller data first. Had I not read the abstracts, I would have grouped all three as “cautionary tales against the Big Data fad.”
There is an important lesson to learn here: Always look at raw data as closely as possible. When you summarize data, some information is munged and invariably lost.
Machines Learn and Spark!
If “Big Data” and “Hadoop” comprise a smaller fraction of Strata headlines, what’s replacing them? I haven’t done the full n-gram analysis, but “machine learning” and “Spark” are on the rise:
Although the numbers are still small at ~13 talks (~5.8 percent), “machine learning” and “Spark” are on the rise. It’s expected that the two graphs are rising in lockstep; Spark’s flagship use case is machine learning calculations.
Streaming in Real Time?
Another pair of highly correlated keywords rising in popularity are “stream” and “real time.” There is no surprise here either, since most “real time” data come in as streams.
The Longest and Shortest Titles
Naming a talk is really hard. You want to have enough keywords to catch skimmers’ attention, but not so buzzwordy that it comes across as fake and faddish.
The longest title was “Office Hour with Joshua Patterson (Accenture Technology Labs), Nathan Shetterley (Accenture), and Michael Wendt (Accenture Technology Labs)” at 139 characters, still fitting in a tweet (but it could have been “Office Hour with Joshua Patterson, Nathan Shetterley and Michael Wendt (Accenture)”). The shortest belongs to “R Day,” a tutorial by RStudio team, which I hope to check out again this year.
So, that’s it for the analysis, and now I need to prove I didn’t generate random graphs.
Thankfully, StrataConf’s website has a consistent URL for all past events: the “all speakers” page has the URL of the form /EVENT_NAME/public/schedule/speakers and the session page URLs look like /EVENT_NAME/public/schedule/detail/ID. Here is the script that I used to extract all 1448 talk titles and URLs from 8 “all speakers” pages:
STRATA_RE = Regexp.new('/(?<event>[a-z0-9-]*)/public/schedule/detail/')
data = Nokogiri::HTML(open(ARGV).read)
data.css("a").each do |link|
href = link.attr('href')
m = STRATA_RE.match(link.attr('href'))
next unless m
href = link.attr('href')
event = m['event']
title = link.text
puts [event, title, href].join("t")
rescue => e
Since I am an R user, data manipulation and visualization were done using Hadley Wickam’s excellent dplyr and ggplot2. For example, here is the R code snippet that produced the “Strata: Up to the Right” graph (The “strata” variable holds the data):
d <- strata %>% group_by(year) %>% summarise(ny = sum(grepl("ny", event)), sj= sum(!grepl("ny", event)))
ggplot(d, aes(year)) +
geom_line(aes(y=sj, color="Strata"), size=1.1) +
geom_line(aes(y=ny, color="Strata NY"), size=1.1) +
ylab("# of talks") +
scale_x_date(labels=date_format("%Y"), breaks="1 year") +
scale_color_manual("", values=c("Strata"="red", "Strata NY"="#444444")) +
ggtitle("Strata: Up to the Right") + theme(plot.title=element_text(size=24, vjust=2))
Finally, the data is available here.
I’m curious, do you believe the trends apparent in the Strata + Hadoop World session titles are representative of the industry as a whole? Contact me on Twitter @KiyotoTamura or comment below.
And please visit us at Strata, booth 1324. We look forward to talking with you!