Big Data Tech Day 2018

Background

This is a great event put on by MinneAnalytics every year.  MinneAnalytics is a Minnesota non-profit organization dedicated to serving Minnesota’s Data Science and Analytics community.   Big Data Tech Day brings in top speakers and sponsors to discuss topics in Data Science.

For more information about MinneAnalytics, check out their web site at http://minneanalytics.org or on the Twitter (@MinneAnalytics).  The MinneAnalytics’ Big Data Tech Day page will have Speaker’s backgrounds and Sponsor information (http://minneanalytics.org/bigdatatech/ ).

Thanks to MinneAnalytics, Volunteers, Sponsors and Presenters who put together another great event this year!

Quick Summary of Big Data Tech Day 2018 Sessions I Attended

General Disclaimer:  The following summaries from my notes for presentations that I attended.  If the slides are posted, I’ll update this post with that URL path.

Modern Big Data in the Era of Cloud, Docker and Kubernetes

Slim Baltagi

Theme:  Microservices based on Containers
– Overview of Docker
– Overview of Kubernetes
– Overview of Apache tools used in Big Data (Spark, Kafka, Flink, Cassandra and ZooKeeper)
A few links to free training for Machine Learning on Kubernetes;
main training site:  http://www.katacoda.com

Best of Cloud, On-Prem, and People Growing Analytics to ML

Bryan Whitmore

Theme:  Four Common Needs For Customer’s Machine Learning
1.  Data Discovery
2.  Unlocking “Dark” Data
3.  Performance at Scale
4.  Overcoming Silos

The Future of ETL (Isn’t What It Used To Be)

Gwen Shapira (@gwenshap)

Presenter’s blog post on this topic:
https://www.confluent.io/blog/the-future-of-etl-isnt-what-it-used-to-be/

Ms. Shapira is one of the authors of book “Kafka The Definitive Guide” (ISBN-13: 978-1491936160)

Interesting presentation on ETL past, present and future.
Past – Classic Data Warehousing Modelling
Present – Big Data:  one big blob destination – data turned over so now it is someone else’s problem.  😉
Future – Data Streams using products like Apache Kafka in Cloud, Microservices and DevOps

ETL has evolved with new terminology, technology and processes.  But the desired outcomes of new Data Streams are still the same as traditional ETL.

Predicting Heart Failure with 125 Dimensions and 1.7 Million Patients

Chris Manrodt

Presentation was an interesting company journey through a Data Analytics project in the biomedical field.
Data issue – data set was based on customer care, which was unstructured, not designed to be transferred between systems.
Regulatory challenge – No direct “how-to” to obtain regulatory approval.
Assessment of Analytics Options was most interesting how the team arrived at their solution.

Calculating Genetic Diversity for a Registry of More Than 8M Bone Marrow Donor Volunteers

Eric Williams, Pradeep Bashyal and Debra Turner

I liked the flow and structure of this presentation.  (I needed the background information to understand the project being discussed.)
– Background information of what information needed to be analyzed
– Problem Domain / Product Evaluation / Tips for Big Data Exploration
– Results / Outcomes
– Lessons Learned

Agile Data Engineering:  Eliminate the Complexity of Big Data Through Automation

Ramesh Menon

Theme:  Automation
Excellent business case and high level path for automating data processes.
Data Engineering is the obstacle on a path to Agility.

Manage your data processes quickly and easily
– Large number of use cases
– Large amounts of data
– Large number of users
– Handle rapid changes

Agile Data Engineering Platform
– Data Ingestion and Synchronization
– Data Transformation
– High-Performance Models
– Production Operations

*Make it easy for people who know the data to write queries

Containerization: Containing the Data Beast

Ashley Nelson and Reshu Yadav

Author’s note:  Well, I like the “pets versus livestock” analogy since I don’t get attached to food I’m planning to eat someday.  😉

Seriously, this was a great presentation on how to organize a Data Analytics team and project using Containers.  Nothing adds cost to a project faster than rework – both in terms of time and money.  Many Data Science issues quickly surface if teams are not organized.  Ms. Nelson and Ms. Yadav did a very good job describing the team’s initial problems, evaluating solutions and implementing a Container platform change to improve workflow.

Presentation Topics:
– Containerization Technology Overview
– General Mills Data Science Journey with Containers
– General Mills Future Plans
– Advice on Containers

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s