Meetup: State of the art R Programming from the Bay Area R Users Group

Meetup: State of the art R Programming from the Bay Area R Users Group

Back on May 10, Treasure Data hosted the Bay Area useR Group Meetup at our Mountain View Facility, featuring Joseph Rickert, program manager for the (Microsoft R Open) Team, blogger and R evangelist.

Joseph, who once headed up technical marketing at Revolution Analytics prior to Microsoft’s acquisition, is widely known as a major authority on R programming. (In fact, I’ve enjoyed the opportunity to learn about R and machine learning from Joe more than once!)

BARUG_2

After opening remarks, Joseph led in with News from the R Consortium, including some announcements and general information. The R Consortium is a trade organization with the mission of supporting the R infrastructure for the entire community. We also learned a bit about R’s architecture, before hearing about efforts toward R Implementation, Optimization and Tooling.

Incidentally, did you know that only around 25% of R packages have vignettes (associated documentation) largely due to the English language barrier for the many international users of R? This will hopefully lead to more efforts around L10N. We hope so, anyway. It would complement the efforts around training instructors and running working groups that Joseph also talked about.

The R consortium is funding efforts to improve the R infrastructure in various ways with grants, and R developers can take advantage of this; to that end, there’s a call for proposals for community projects currently open (until July 10).

You can view the video of Joseph’s talk here.

Get Treasure Data blogs, news, use cases, and platform capabilities.

Thank you for subscribing to our blog!

During his talk, Stanford University fellow Pete Mohanty talked about Optimizing non-parametric Regression in R. Data Regularization allows us to maximize inference while minimizing the amount of assumptions we need to make. While we can mitigate overfitting by using regularization (weighted average of all messiness of our data), we find that some regularization techniques which require computation of weight matrices of dimensions N x N can be very costly and may be unsuitable for large datasets.

In addition to explaining KRLS (Kernal Regularized Least Squares) Pete’s talk also discussed:

  • An example of Treatment Effect Heterogeneity in a ‘Get Out The Vote’ Field Experiment;
  • Introducing bigKRLS (computational constraints and strategies to avoid them in R).

Pete’s talk is on video here.

Finally, Tom Miller talked to us about Benchmarking R. While R was originally intended only for statistical analysis, the language has morphed, out of necessity, into a multi-purpose programming language that can handle everything from text processing to databases and websites. With all this use comes the notion of the “memory cliff” (which Pete talked about earlier), where the dataset you are working with is no longer supported in memory. What do you do?

Tom, from Tidal Scale, discussed running a benchmark by combining several physical servers into one “Virtual Machine”, where he ran several models on a very large dataset while avoiding the memory cliff. Learn more about both the benchmark and the virtualization tools he used by watching Tom’s video here.

What do YOU think?

Do you use R for analytics in production? At Treasure Data, we combine large-scale data analytics with the power of R. You can read about our integration with R programming language here.

Sign up for our 14-day trial today or request a demo! You can also reach out to us at sales@treasuredata.com!

John Hammink
John Hammink
John Hammink is Chief Evangelist for Treasure Data. An 18-year veteran of the technology and startup scene, he enjoys travel to unusual places, as well as creating digital art and world music.