在线课程

大会介绍

 

第十届中国R会议(太原)暨山西省大数据产业创新发展论坛主会场演讲介绍

 
 

R语言是在统计和数据科学界广泛应用的编程语言和开发环境,其免费、开源、灵活的特点,使其受到越来越多的关注。中国R会议(The China-R Conference) 正起始于对R语言的讨论。2008年,统计之都(Capital of Statistics, COS) 在中国人民大学举办了第一届中国R会议,如今中国R会议规模越来越大,已发展到在全国多个城市举办,成为累计参与人数超过2万、参会单位超过3千家的盛会。会议内容覆盖数据科学在各行各业的应用,包括医疗、生物、金融、工业、自动化、互联网等诸多领域,形成了深远的影响,促进了R语言乃至整个数据科学在中国的推广和发展。

 

2017年,中国R会议已走过十载春秋。在这样一个值得纪念的年份,中国R会议也将来到龙城太原。太原理工大学数学学院、大数据学院和统计之都,将携手在太原主办第十届中国R会议(太原),会议时间为2017年6月24日至25日。我们诚邀国内外学术专家、业界精英和技术大咖同台交流,分享您与数据科学的那些事儿,共赴中国R会议十年之约。

 

本届中国R会议(太原)主会场日程如下:

 

时间:2017年6月24日

地点:大数据学院九层报告厅

 

 

接下来就为各位带来本届中国R会议(太原)主会场演讲介绍。

 

 

崔跃华

美国密西根州立大学

Gene selection with nonlinear instrumental variable regression incorporating network structures

 

 

Genetical genomics data provide promising opportunities for integrative analysis of gene expression and genotype data. Lin et al. (2015) recently proposed an instrumental variables (IV) regression framework to select important genes with high dimensional genetical genomics data. The IV regression solves the problem of endogeneity issue caused by potential correlation of gene expressions and the error terms, hence improves the performance of gene selection. As genes function in networks to fulfill their joint task, incorporating network or graph structures in a regression model can further improve gene selection performance. Furthermore, gene expressions can be nonlinearly regulated or modified by environmental variables. In this work, we propose a graph constrained penalized nonlinear IV regression framework to solve the endogeneity issue and to improve the selection performance via considering gene network structures. We propose a two-step estimation procedure by adopting a network constrained regularization method to obtain better variable selection and estimation, and further establish the selection consistency. Simulation and real data analysis are conducted to show the utility of the method. 

This is a joint work with Bin Gao and Xu Liu.

 

王军辉

香港城市大学

A smooth collaborative recommender system

 

 

In recent years, there has been a growing demand to develop efficient recommender systems which track users’ preferences and recommend potential items of interest to users. In this talk, I will present a smooth collaborative recommender system to utilize dependency information among users and items which share similar characteristics under the singular value decomposition framework. The proposed method incorporates the neighborhood structure among user-item pairs by exploiting covariates to improve the prediction performance. One key advantage of the proposed method is that it leads to more effective recommendation for “cold-start” users and items, whose preference information is completely missing from the training set. As this type of data involves large-scale customer records, efficient scheme will be proposed to achieve scalable computing. The advantage is confirmed in a variety of simulated experiments as well as one large-scale real example on Last.fm music listening counts. If time permits, the asymptotic properties will also be discussed.

 

林共进

美国宾州州立大学

大数据: 无关乎数据大小和数据本身,重要的是统计思维

 

 

 

烧一道好菜,你需要三个要素:上好的食材,好用的炊具,以及好的烹饪方法。研究大数据也是同样的道理。想做好大数据的工作,你的食材是大数据,你的炊具是计算机,而最重要的烹饪方法则是统计思维。值得一提的是,食材和炊具只要肯花钱都可以买得到,唯有烹饪方法是你的——你必须自己学。本报告将通过几个简单的案例分析,教你烧几道大数据的好菜,并对大数据的未来,尤其是物联网的未来发展,提出一些个人的看法。

 

宋卫星

堪萨斯州立大学

Tweedie-Type Formulae and Regression Calibration

 

 

Regression calibration is one of the most commonly used bias-reduction technique in measurement error modelling. However, Tweedie’s formula, originally discovered for normal measurement errors, has never been used for regression calibration, instead, many approximate algorithms are developed for the same purpose. In this talk, we shall introduce a set of Tweedie-type formulae not only for multivariate normal measurement error, but also for multivariate Laplace measurement error, a typical case of the ordinary smooth cases. Potential applications of these Tweedie-type formula in parametric/semiparimatric regression models, neural networks with measurement errors will be also discussed.

 

吕庆

美国密西根州立大学

A generalized association test based on U statistics

 

 

Sequencing-based studies are emerging as a major tool for genetic association studies of complex diseases. These studies pose great challenges to the traditional statistical methods because of the high-dimensionality of data and the low frequency of genetic variants. Moreover, there is a great interest in biology and epidemiology to identify genetic risk factors contributed to multiple disease phenotypes. The multiple phenotypes can often follow different distributions, which bring an additional challenge to the current statistical framework. In this talk, I will introduce a generalized similarity U test, referred to as GSU. GSU is a similarity-based test that can handle high-dimensional genotypes and phenotypes. We studied the properties of GSU, and provided the efficient p-value calculation for association test. Through simulation, we found that GSU had advantages over existing methods in terms of power and robustness to phenotype distributions.

 

汪浩

江西师范大学

BCPNN&GPS药品不良反应信号检测的统计原理及其R实现方法

 

 

贝叶斯置信度递进神经网络(Bayesian Confidence Propagation Neural Network,BCPNN)和伽玛泊松分布缩减法(Gamma Poisson Shrinker,GPS)分别是世界卫生组织(WHO)和美国食品药品监督管理局(FDA)采用的药品不良反应信号检测算法。R包PhViD给出了这两个算法的部分实现,但没有给出药品不良反应信号检测人员常用的IC、EBGM、EBGM05等指标。通过阅读这两个算法的相关文献,我们利用R和Mathematica,完整实现了这两个算法及其全部指标。进一步,我们为江西省药品不良反应监测中心,计算了2004年至2016年药品不良反应信号。本次演讲将讨论这两个算法的统计学原理及其实现细节。

拟邀嘉宾

时间:06-24 09:15 - 16:40
地点:太原理工大学大数据学院九层报告厅

报名购票

主办方