Hadoop for Statisticians

Loading Map....


This workshop is now sold out. To register your interest for this workshop to be held again, please click here.

A workshop with Hon Hwang, Jan Luts and Louise Ryan – Sydney

About the course

These days, everybody seems to be talking about big data!   As statisticians, we are very familiar with analyzing big, complex data sets. However, we are not so sure how to proceed when datasets are so big that they won’t fit onto a single, standard computer. There are a number of possible strategies, including the use of more powerful machines and supercomputers or utilizing GPUs (graphics processing units).   In the machine learning world, however, there have been remarkable developments in recent years related to the analysis of massive datasets that are stored in distributed data systems.   Google’s MapReduce paradigm has set the stage for this work, the concept being to divide data into manageable analysis subsets, then combining the results. Open source implementations of MapReduce such as Hadoop and similar system such as Spark have appeared in recent years. At present, only relatively simple statistical analysis tools and methods are available in these distributed data systems. Hence there are major opportunities for statisticians to contribute by developing new algorithms that work in the distributed data setting.   However, there is quite a steep learning curve to getting started with systems such as Hadoop and a relatively high level of computing sophistication is needed.   This workshop aims to help bridge this gap by providing a hands-on introduction to Hadoop. The workshop will explain the broad concepts of map/reduce then get participants running simple Hadoop jobs on a cluster supported through Amazon Web Services.   We will go over some of the interesting new methods being developed for analysis on distributed data systems, such as Bags of Little Bootstraps, and logistic regression for massive data sets.


Familiarity with R will be assumed. Ideally participants should be working from a Mac or running linux. Windows users can participate, though the interface is slightly more complicated and involves some extra steps.  Participants will receive instructions shortly before the course regarding materials that should be pre-loaded on to their machines.   Some optional pre-reading materials may also be distributed ahead of the course.

About the Instructors

Hon Hwang received a Bachelor of Engineering, Computer Systems (First Class Honours) from the University of Technology Sydney (UTS) in 2006. After this, he joined CSIRO as a Research Engineer, working on a variety of different problems, including computer and information security, remote telecommunication, Atlas of Living Australia, and personal profile matching. In 2012, he became an independent contractor, working in data integration and social media monitoring projects. After discovering that data analysis was the part of his work that he most enjoyed, he enrolled to study for his PhD in statistics, under the supervision of Professor Louise Ryan, at University of Technology Sydney. Hon plans to do his PhD research on the topic of statistical analysis on distributed data systems.

Jan Luts received a Master of Information Sciences, with a specialty in Multimedia, from Universiteit Hasselt, Belgium, in 2003. He also received Master degrees in Bioinformatics and Statistics from Katholieke Universiteit Leuven, Belgium, in 2004 and 2005, respectively. After obtaining his PhD at the Department of Electrical Engineering (ESAT) of Katholieke Universiteit Leuven in 2010, he continued there for two years as a postdoctoral fellow. In 2012 he moved to Australia where he worked as a research fellow in Statistics at the School of Mathematical Sciences in the University of Technology Sydney, under the guidance of Professor Matt Wand. Last year he moved into the private sector as a Data Scientist at The Search Party – an online marketplace that connects employers, job seekers and recruiters to help people find jobs better, faster and easier.

Louise Ryan received her BA in Statistics and Mathematics from Macquarie University in 1978 and her PhD in Statistics from Harvard University in 1983. After spending the next 25 years as a faculty member at the Harvard School of Public Health, she returned to Australia in 2009 to take up the role as Chief of the Division of Mathematics, Informatics and Statistics at CSIRO. She returned to academia in late 2012, joining University of Technology Sydney as a Distinguished Professor of Statistics. Louise is well known for the development and application of statistical methods in health and environmental sciences. She presently serves as President of the NSW Region of the Statistics Society of Australia.

Course fees:

Student members of SSAI: $50

Student non-members of SSAI: $75

Non-student members of SSAI: $150

Non-members of SSAI: $300

There are only 25 places available for this workshop, so please book early to avoid disappointment.


Registrations close on 12 August 2015 or earlier if the course is booked out.


Travel Expenses

Occasionally workshops have to be cancelled due to a lack of subscription. Early registration ensures that this will not happen. Please contact the SSAI Office before making any travel arrangements to confirm that the workshop will go ahead, because the SSAI will not be held responsible for any travel or accommodation expenses incurred due to a workshop cancellation.

Cancellation Policy

Cancellations received prior to 11 August 2015 will be refunded in full. Confirmation of the refund having been processed will be emailed. Should additional documentation pertaining to the refund be required, a $20 administration fee will be charged.

After 11 August 2015 no part of the registration fee will be refunded. However, registrations are transferable within the same organisation. Please advise any changes to [email protected].




Hadoop for Statisticians
When: 18/08/2015
Time: 9:00 am - 5:00 pm
Cost: from $50.00
Location: University of Technology Sydney,
638 Jones Street,

Get the latest posts delivered to your mailbox:

Show Buttons
Hide Buttons