My journey into Amazon Web Services (AWS) started like I’m sure many people – set up a free account, create an EC2 instance along with an S3 bucket… Then wonder what to do next. This blog post will cover a couple AWS, Spark and R Studio training resources that I used to advance my AWS experience. The purpose is to help a reader get AWS, Spark and RStudio set up and running.
After I first created my AWS account, I had the deal with the fear of a massive charge on a credit card for some simple mistake. Fortunately, I haven’t experienced that yet. AWS offers the “first year free” promotion, but I think that just refers to one EC2 instance and one S3 storage container. That is enough to get started, additional instances or storage containers don’t cost much to add and use. That being said, it is a good habit to verify unused instances and storage containers are shut down and/or terminated when not in use.
AWS – If you haven’t already, go ahead and take the leap
If you haven’t set up your AWS account yet, you can do so here:
Training Resource One: O’Reilly Training Video
Using R for Big Data with Spark – Training Video Information
Title: Using R for Big Data with Spark
By: Manuel Amunategui
Final Release Date: October 2016
Publisher: O’Reilly Media
AWS and SparkR Installation and Configuration
The first part of the training video provides a very good introduction to AWS and Spark. Although total time in training video is just over two hours, the early lessons do a great job of walking through steps to create an EC2 instance, cluster to be used with SparkR and installing R Studio on new cluster. There is also a separate, couple minute lesson on how to terminate instances when finished. The separate lesson makes it easy to refer back to if you need to terminate instances – like taking a six break between lessons. 😉
Tip: Keep a separate text file with step-by-step procedure and Linux commands to set up AWS EC2 instance, cluster and R Studio installation. This will help you quickly set up a cluster and install R Studio in future training sessions.
Data Modelling and Data Sources
Using R for Big Data and Spark training video also includes some very good sections on Data Modelling and Data Sources. Data Modelling covered the theory with some good AWS / R Studio in a cluster examples. Data Sources lessons went into details of AWS S3 to provide examples of storage of source data and results from R Studio.
Training Resource Two: Nerdery SparkRTalk
R Studio Server on Amazon EMR – Presentation at Big Data Tech 2016 Conference
Presenters: Chad Dvoracek and Brandon Veber
Company: Nerdery (https://www.nerdery.com/)
Date: June 2016
Slides: BigDataConference2016.pdf (file located in github repository)
The Nerdery company was one of the sponsors for Big Data Tech 2016 conference. Chad and Brandon had a presentation about AWS, Spark and R Studio. The first dozen slides are reasons to use AWS for Data Science work. The remaining slides walk through an example with a quick EMR (Elastic MapReduce) setup and some data analysis from a project on Kaggle (https://www.kaggle.com/).
Differences between the two resources
- How the AWS clusters are created. The O’Reilly training video creates the cluster using a Linux command at the EC2 instance prompt. The Nerdery’s presentation creates the cluster using the AWS console GUI.
- Data Set – O’Reilly used smaller data sets to walk through examples. The Nerdery presentation used a large data set to show benefit of distributed processing.
- The Nerdery example pulled data into Hive.
Either training resource will help a new Data Scientist get an R Studio environment set up on AWS. The examples help build confidence while working in a cloud environment. Go ahead – give it a try and see what happens. Good Luck!!!