Call Us: +1 646-926-3881(USA) - Mail: contact@supstat.com

IMS Bulletin: Global Attention focused on China: The 7th China R Conference in Beijing

It is not surprising that this big conference in China could attract so much attention. The recent meeting report on August 2014 Issue of IMS Bulletin described it as ‘a folk festival’ and fully approbated its great influence.

China R Conference has become a great platform for the improvement in data science from industry to academy, not only a chance to promoting data science using R. IMS Bulletin compared with the 1st China R conference which was organized only by a small group of students and the student-run website Capital of Statistics (COS). Then after six years, much more people from various fields or having different occupations came together and brought all kinds of programming languages. In addition, IMS Bulletin also pointed out the trend of participants becoming younger, which indicated a rosy future.

As one of the sponsors, SupStat is very glad to attend the whole conference and enjoys this great chance to see people sharing wonderful ideas.

IMS Bulletin is one of the journals and publications of IMS (Institute of Mathematical Statistics), which is a global community of scholars and professionals in the field of statistics and probability. As a featured publication, IMS Bulletin reports the articles and news of the local and international meetings according to the interest of their members.

Meeting Report in IMS Bulletin

NYC R Programming Classes – starting on July 13th, 2014 (this coming Sunday)

We are happy to announce our 6th offering R class at NYC in the past 10 months.  R community is booming. Learn Data Science at the heart of NYC with us!

You can sign up for our Sunday Intensive beginner level R classes at
NYC Data Science Academy (click to sign up) or email vivian.zhang@supstat.com for more info.

Brief: The course (which will meet five Sundays) will start from the basics,
introducing the building blocks used for programming in R and building
intuition for writing clean and robust code. We will move on to cover
data analysis, applications of statistical techniques, and graphing. This course will introduce you to the wonderful wold of R and provide you with an excellent understanding of the language that leaves you with a firm foundation to build upon.

Date: July 13th, 20th, 27th, and Aug 3rd and 10th(five Sundays)

Time: 10:00 am to 5:00 pm

Venue: 500 7th Ave, 17th Fl., New York, NY (close to Times Square)

Instructors:
Scott Kostyshak (Data Scientist @ Supstat Inc, 5th year PhD at Princeton Univ.)
Charlie Redmon (Project Manager @Supstat Inc, Master degree)

Cost:

* $1500/5 classes is highly recommended for absolute beginner or someone have little experience if you want to master the content and feel very comfortable to work with R from now on.

* $350/class if  you have good R programming skilss and want to enhance your skill on specific topics.

* For group(5+ persons) and enterprise pricing, please email vivian.zhang@supstat.com

Course Outline:
(Content may be adjusted based on the real teaching condition)

Why R is important
R is a free, full, and dynamic programming language that, since its release in 1996, is on course to eclipse traditional statistical packages as the dominant interface in computational statistics, visualization, and data science. As an open-source platform, R has grown to become an incredibly flexible tool that can be applied to nearly every graphical and statistical problem, at virtually no cost to the user. The community of R users is continuing to build new functionality.

Project Demo Day and Certificates

After the rudimentary building blocks of programming basics, to data manipulation and use of advanced drawing packages, the course ends with a demonstration of a project of your choice on Project Demo Day. On Demo Day you will access and analyze real data, utilizing the tools and skillsets taught to you throughout the course. After the successful completion of the course, you will qualify for one of three certificates: Extraordinary Standing, Honorable Graduation, and Active Participation. Certificates are awarded according to your understanding, skill, and participation.

Refund Policy:

Students may receive full tuition reimbursement upon completion of the first class day if they decided to drop.

Syllabus

1. Basics: 12 hours
Abstract: Explain the basic operation of knowledge through this unit of study. Students will learn the characteristics of R, resource acquisition mode, and mastery of basic programming.
Case Study and Exercises: Use the R language to complete certain Euler Project problems.

  • How to learn R
  • How to get help
  • R language resources and books
  • RStudio
  • Expansion Pack
  • Workspace
  • Custom Startup Items
  • Batch Mode
  • Data Objects
  • Custom Functions
  • Control Statements
  • Vectorized Operations

2. Getting Data: 6 hours
Abstract: Explain the various ways the R language reads data, bring the participants through basic knowledge of web crawling, and connect to the database via sql statement calling data from a variety of locally read excel file data.
Case and Exercises: Crawl watercress data on the site and write a custom function.

  • Web data capture
  • API data source
  • Connect to the database
  • Local Documentation
  • Other data sources
  • Data Export

3. Data Manipulation: 6 hours
Abstract: How to manipulate data and use R for the all kinds of data conversion, especially for string operation processing.
Case Study and Exercise: Find the QQ (the most used instant messenger tool) group, then discuss research options with text features.

  • Data sorting
  • Merge Data
  • Summary data
  • Remodeling Data
  • Take a subset of data
  • String manipulation
  • Date Actions

4. Data Visualization: 6 hours
Abstract: Cover two advanced drawing packages (Lattice and ggplot2) and understand the various methods of visualization.
Case and Exercises: Using graphics, text and other data.

  • Histogram
  • Point
  • Column
  • Line
  • Pie
  • Box Plot
  • Scatter
  • Matrix related
  • Map

What does SupStat offer?(click on the image to see more details.)
Our services include consulting on statistical methods, software training on statistical computing and data analysis (mainly R), statistical graphics and data visualization, as well as statistical reports. We have Beijing, Shanghai, and New York offices. Our team includes top 0.1% ranked Kagglers.(www.kaggle.com hosts excellent data mining competitions and gathers more than 100K data scientists.) For business inquiry, please email: vivian.zhang@supstat.com.

Jun 26-27, 2014 – Introduction to Data Science with R in NYC

You can either register from eventbrite or our school site NYC Data Science Academy.

Date: Thursday/Friday , June 26th and 27th, 2014

Time:  9:00am to 5:00pm

Location: 500 7th Ave, 17th Floor, glass door classroom, New York, NY 10018

NYC Data Science Academy, training subbrand of SupStat (Official Training partner with RStudio Inc) is hosting our two day Introduction to Data Science with R course in New York City this June. This is a two-day workshop, designed to provide a comprehensive introduction to R. We’ll get you programing and analyzing data with R in no time. All participants will receive a copy of all slides, exercises, data sets, and R scripts used in the course.

We will emphasize how you can get work done easiy with Rstudio IDE.

Course Instructor:

Vivian S. Zhang (CTO of SupStat, Organizer of NYC Open Data Meetup, Founder of NYC Data Science Academy)

Information about our school:

Check our past students’ excellent testimonial and their projects and blog posts after they finished our classes.

Our Data School has trained 96 students in the past 8 months, including topics in R, Python, Hadoop, D3.js, Tableau,etc.

Discount:

Discount pricing available for academics (33% off) and students (66% off). Space is limited, please write to Send Mail to confirm your eligibility.

Who should take this course?

This class will be a good fit for you if you are just starting with R or have dabbled in R, but wish to improve your skills. No prior experience with R or data science is required.

What will you learn?

Practical skills for visualizing, transforming, and modeling data in R. During this two-day course, you will learn how to explore and understand data as well as how to do basic programming in R. A full list of topics for each day is below.

What should you bring?

Be ready to learn. You need your laptop and the latest version of R. We also recommend downloading the Rstudio IDE, as it provides a great learning environment for beginners as well as tools for when you transition into an advanced user.

Overview

Are you interested in better understanding your data, and not so interested in mastering a programming language? Have you tried learning R from a book or website, but have been discouraged? If so, this is the course for you. We assume that you’ve never programmed before (although some experience doesn’t hurt), and we teach you the best tools to help analyze your data.

You won’t be a master programmer by the end of this two-day course, but through immersion you will have learned the basics of R’s syntax and grammar, and you’ll have started building an effective R vocabulary for visualizing, transforming, and modeling data. You will learn how to load, save, and transform data as well as how to write functions, generate beautiful graphs, and fit basic statistical models to your data. We’ll give you a theoretical framework to help you understand the process of data analysis, but our focus is on practical tools that you can use as soon as you get back from the course.

All techniques are motivated by real problems, and you’ll be exposed to a number of real datasets throughout the course. We alternate brief lectures with hands-on practice: you’ll get plenty of experience actually using R (not just hearing about it!), and there’s plenty of help available if you get stuck. The course concludes with a 90-minute data analysis project. You can use this as an opportunity to start using R with your data, or work on answering some of our questions about a dataset.

This tried and true course has been taken by over 200 students, from biologists to humanists, many of whom had never programmed before. This course teaches the basic skills needed by anyone seriously interested in data.


Day 1 – Getting started and working with data

Thursday, June 26th, 2013 9:00am to 5:00pm

An Introduction to R and data analysis - R is more than just a programming language. R is a statistical software application in its own right, an environment for interactive data analysis, and a community of passionate users. This orientation to the R language will help you get up and running.

  • How to download and update R and SupStat
  • How to find resources and help for R
  • Stages of data analysis
  • Best practices of data analysis

Visualizing data - R’s is well known for its beautiful graphics. R packages, like `ggplot2`, provide an expressive and logical language for building clear and effective data visualizations.

  • Visualize the distribution of a variable
  • Exploring and plotting relationships between variables
  • Display very large data sets through graphs without over-plotting
  • Use best practices for Exploratory Data Analysis in R code

Working with data - R is a programming language with a purpose: to analyze data. Learning how R stores and handles data will help you apply R to any data source.

  • Loading different data formats into R
  • Working with factors in R
  • How to clean poorly formatted data
  • Saving your data

Manipulating data - R’s methods for data manipulation make it easy and fast to extract information from data sets and to prepare raw data for analysis.

  • Subset, transform, summarize, and reorder data sets
  • Perform targeted, groupwise operations on data
  • Join multiple data sets together

Day 2 – Programming and modeling in R

Friday, Jun 27th, 2013 9:00am to 5:00pm

Programming in R - Many people use R as an application, a sort of statistical calculator, but R is also a programming language. Once you learn to program in R, you will be a more versatile and capable data analyst. You’ll learn to write code that provides the precise solutions you are looking for.

  • Create an if else statement
  • Write and optimize for and while loops in R
  • Use best practices for programming in R

R functions - Functions allow you to save your code for later or to share it with other R users. Knowing how to write a function will also streamline your workflow. Functions give code a more efficient structure that avoids duplication and aids debugging.

  • Organize a problem into a series of functions
  • Write a function in R
  • Apply best practices for writing functions in R

Simulation in R - Simulating data provides a way to test hypotheses and discover the uncertainty in your estimates.

  • Generate random numbers in R
  • Visualize uncertainty with bootstrapping in R
  • Construct a confidence interval with bootstrapping in R
  • Test a hypothesis with a permutation test in R

Modeling in R - R excels at statistical analysis and modeling, but its methods for modeling may seem unintuitive at first.

  • Write a formula in R
  • Fit a model to data in R
  • Compare models
  • Explore data sets with models

Disclaimer:

In certain cases, we may need to cancel this workshop due to circumstances beyond our control or otherwise. If this happens, SupStat will refund all registration fees for those who signed up. SupStat is not responsible for any related expenses incurred by registered attendees (including but not limited to travel and hotel expenses).

Refund policy:

Until Jun 15th, 2014 – Full refund, less 10% of registration fees
Jun 15th, 2014 to June 21st, 2014 – 50% refund of registration fees
Jun 22nd, 2014 and after – No refund available

Money-back guarantee:

All public workshops hosted by SupStat with a no-questions-asked money-back guarantee.

The 7th China R Conference in Beijing

The 7th China R Conference in Beijing was held on May 24th ~May 25th in Renmin University of China. SupStat is really happy and honored to sponsor and attend this meeting.

ChinaR

This is the largest ever R conference in China with 1814 registrations online and even 50 more requests of attendance with the help of special connections and friendship, after the online application system closed! At last the free and limited seats had to be distributed on a first-come-first-served basis.As we could see from the below, people sit on the floor, lean on the wall, wait at the door, for this free and open R party!

crowd

There are many interesting talks here! The first day’s meeting was held in one of the largest most luxury rostrum(as a registration-free conference, thanks to the sponsorship of SupStat, Revolution Analytics, RStudio and so on).  Hadley Wickham’s talk about developing R packages and David Smith’s talk about How the growth of R helps data-driven organizations succeed served as the beginning part.  Dr. Kai Yu,Head of Institute of Deep Learning at Baidu, Dr. Ming Zhou, Principal Researcher of Natural Language Processing at Microsoft Research Asia and  Dr, Hansheng Wang, the statistic department head of Peiking University, also talked about their recent work and research projects.

Hadley

david

In the afternoon, all attendance listened enthusiastically to the company lightning part and the discussion forum part for Big Data and data science education.

SONY DSC

forum  

The next day’s meeting is divided into 3 parallel sessions; A on visualization; B for Big Data; C for R integration and others:

  • Session A(visualization) talks about ggvis by Hadley Wickham, recharts by the author Dr.Zhou and so on.
  • Session B(Big Data) talks about RHadoop, R-Web, rARPACK for SVD in ultra large matrix and the big data industry in China.
  • Session C(R integration and others) talks about R with python, R with Office , R in advertising, R in data mining, R in biology , Psychology and pharmacy  research.

Based on Beijing, the 7th China R conference attracted people all around the mainland China, Hong Kong, Taiwan and even outside China. Drawing from the registration information we could know that R users in China are mostly working in networks, IT, biology, finance and education. They are mostly interested in data mining and machine learning, data visualization and data solutions.  

Ruserindustry

The R users attended the China R conference in Beijing is increasing amazingly! How about the next year? Ruser

Install RStudio Server on centOS6.5

      My system is 64-bit centOS 6.5. The 64-bit version then is 0.98.766 and the following error  appeared when installing RStudio Server preview version:

[root@supstat download]# rpm -ivh rstudio-server-0.98.766-x86_64.rpm
error: Failed dependencies:
    libcrypto.so.6()(64bit) is needed by rstudio-server-0.98.766-1.x86_64
    libgfortran.so.1()(64bit) is needed by rstudio-server-0.98.766-1.x86_64
    libssl.so.6()(64bit) is needed by rstudio-server-0.98.766-1.x86_64

Refer to an article on stackoverflow:

yum install libcrypto.so.6 -y
yum install libgfortran.so.1 -y
yum install libssl.so.6 -y
yum install openssl098e-0.9.8e -y
yum install gcc41-libgfortran-4.1.2 -y
yum install pango-1.28.1 -y
 
wget ftp://rpmfind.net/linux/centos/6.5/os/x86_64/Packages/compat-libgfortran-41-4.1.2-39.el6.x86_64.rpm
rpm -Uvh compat-libgfortran-41-4.1.2-39.el6.x86_64.rpm
rpm -Uvh --nodeps rstudio-server-0.98.766-x86_64.rpm

Completion of the above steps is still not enough. Running rstudio-server verify-installation still reports errors.

The problem lies in the lack of library files, but there is indeed a file as mentioned in the error report  under /usr/lib. Then let’s check the library files under / usr/lib64 :

[root@supstat lib64]# ll libcrypto*
lrwxrwxrwx 1 root root      19 Apr  9 12:15 libcrypto.so -> libcrypto.so.1.0.1e
lrwxrwxrwx 1 root root      19 Apr  9 12:15 libcrypto.so.10 -> libcrypto.so.1.0.1e
-rwxr-xr-x 1 root root 1950976 Apr  8 10:42 libcrypto.so.1.0.1e
[root@supstat lib64]# ll libssl*
-rwxr-xr-x. 1 root root 250168 Feb 11 21:01 libssl3.so
lrwxrwxrwx  1 root root     16 Apr  9 12:15 libssl.so -> libssl.so.1.0.1e
lrwxrwxrwx  1 root root     16 Apr  9 12:15 libssl.so.10 -> libssl.so.1.0.1e
-rwxr-xr-x  1 root root 441112 Apr  8 10:42 libssl.so.1.0.1e

We could find there is no libcrypto.so.6 and libssl.so.6.

Attempting  to create a soft link library files:

cd /usr/lib64
ln -s libssl.so.10 libssl.so.6
ln -s libcrypto.so.10 libcrypto.so.6

rstudio-server verify-installation  test passed.

Firewall settings

In the browser, when entering http:// <server IP>: 8787, you will find it impossible to get accessed.  This  is due to built-in firewall policy in centOS which have not added 8787 port. Next, modify the firewall configuration file:

vi /etc/sysconfig/iptables

Add -A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT after the following sentence:

-A INPUT -m state --state NEW -m tcp -p tcp --dport 8787 -j ACCEPT

So that you can access it!

rstudio-server-login

User Settings

Sometimes we want to restrict user access to Rstudio Server, then it is necessary to modify the configuration file /etc/rstudio/rserver.conf, and add the following sentence:

auth-required-user-group=rstudio_users

Then add users groups rstudio_users and user supstat in the command line:

groupadd rstudio_users
useradd supstat 
usermod -a -G rstudio_users supstat 
 
# user password setting
passwd supstat
 
# if you want to add sudo permission for some users ,please refer to #http://www.getroad.cn/blog/?action=show&id=801
 
chown -R supstat:rstudio_users /home/supstat

Then we can use supstat account to login in the Rstudio Server :

rstudio-server

View the process:

[root@supstat R]#ps aux|grep rstudio-server
498      19292  0.1  0.3 212788  3980 ?        Ssl  07:30   0:00 /usr/lib/rstudio-server/bin/rserver
supstat  19307  0.5  4.3 568932 43956 ?        Sl   07:31   0:03 /usr/lib/rstudio-server/bin/rsession -u supstat
root     19414  0.0  0.0 103248   864 pts/0    R+   07:43   0:00 grep rstudio-server

We can see there are two rstudio-server-related processes: one is rserver system, and the other is the user supstat ‘s  rsession.

System configuration and system management

 

For installing under ubuntu you can refer to Zhang Dan’s blog

 

This article is a SupStat original article and shall not be reproduced without permission. For reprint please email contact#supstat.com(# replaced by @)

Node.js Express 3.0 development Basic

Introduction
Through Node.js, JavaScript become ideal tool for server applications. And through jQuery, JavaScript is excellent tool for browser development. Recently we learnt about the Express 3.0 in Node.js. The book we were referring to was talking about Express2.x, so we were meeting some difficulties when learning Express3.0. We would love to share with you how we overcame them.
nodejs-1

 

<<Node.js toolkit>> ©2014, Conan Zhang and Vivian S. Zhang. All rights Reserved. The corresponding post written in Chinese can be found at Conan Zhang’s blog. Please contact vivian.zhang@suptat.com if you are interested to publish it in English.

<<Node.js toolkit>> will introduce you to use Javescript as server side script and use node.js framework to develop website. Nodejs framework is based on the V8 engine which is the fastest Javascript engine. Chrome browser is also based on V8. It is very smooth even when you open 20 to 30 pages simultaneously. Node.js standard web development framework, Express, can help us quickly build web sites. Developing website by node.js is more efficient than doing it by PHP and require less steep learning curve. It is ideal for building small sites and personalized web sites. We want to introduce you a lot of handy tools to reduce your workload and make elegant and beautiful site easily.

 

Content

We will focus on the Express 3.0 Framework, and also related things as Mongoose, Ejs, Bootstrap.

  1. Build a project
  2. Directory Structure
  3. Express3.0 Configuration
  4. Ejs template
  5. Bootstrap Framework
  6. Routing Function
  7. Use of Session
  8. Page Notification
  9. Page Visit Control

Read more »

Series of our new book — Node.js Toolkit

<<Node.js toolkit>> ©2014, Conan Zhang and Vivian S. Zhang. All rights Reserved. The corresponding post written in Chinese can be found at Conan Zhang’s blog. Please contact vivian.zhang@suptat.com if you are interested to publish it in English.

<<Node.js toolkit>>  will introduce you to use Javescript as server side script and use node.js framework to develop website. Nodejs framework is based on the V8 engine which is the fastest Javascript engine. Chrome browser is also based on V8. It is very smooth even when you open 20 to 30 pages simultaneously. Node.js standard web development framework, Express, can help us quickly build web sites. Developing website by node.js is more efficient than doing it by PHP and require less steep learning curve. It is ideal for building small sites and personalized web sites.  We want to introduce you a lot of handy tools to reduce your workload and make elegant and beautiful site easily.

 

Book Content:

Chapter one: get started
Read more »

SVD and Image Compression(by Shiny! Yes!)

We have a brilliant Shiny application demonstrating Image Compression with Singular Value Decomposition. One could get a sense ofm SVD quickly from it: https://yihui.shinyapps.io/imgsvd/.

ImgSVD made by Nan Xiao, who is Data Analyst Intern at SupStat and Yihui Xie who is Software Engineer at RStudio, also advisory Data Scientist at SupStat Inc.

The interface looks like the following:

https://yihui.shinyapps.io/imgsvd/

 Singular Value Decomposition is a complex technique. It is a matrix factorization and is widely used in dimensionality reduction. Its result is an approximation of a matrix, and it is flexible because we can determine the degree of the approximation by the parameter k. Besides, image compression is also focusing on approximation. We can use this method for image compression if we regard an image as a matrix. Following is a demonstration of the result of our algorithm.

Original image is svd11

When we set k=1svd12

Read more »

Upcoming NYC R Programming Classes

It is our pleasure to once again offer the intensive R beginner level course for the third time! Beginning this Sunday, the 35 hour course will walk you through the basic operations and characteristics of R, all the way to having a firm understanding of data manipulation and visualization.Also launching this weekend are two brand new courses, Data Visualization for D3 and Data Science for Python, both for the beginner level.

R users will rule the world :) Make sure to sign up today

Taught by preeminent data scientists in New York City, these beginner NYC Data Science Academy courses are the best introduction to the exciting world of R, open data, and statistical science.

If interested, please read the course descriptions below and RSVP today!

Read more »