Research

Research areas of interest include
IT Infrastructures
IT Architecture & Design
Scientific Computing
e-Science applications
Computational simulations
Cross-disciplinary scientific applications
Interoperability applications of HPC, HTC, and big data
High productivity processing around large-scale datasets
Big data analytics, analysis, methods, and tools
Big Data Analytics - Outlier Detection with Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Morris Riedel, Markus Goetz, Christian Bodenstein
Parallel & Scalable MPI/OpenMP Implementation
More Information
[ Free Available HPDBSCAN Software @ Juelich Supercomputing Centre ]
Abstract:
The open source DBSCAN implementation is freely available for download and support can be requested by the authors. It is based on parallelization techniques using MPI/OpenMP for massively parallel machines in order to be scalable for 'big data'.
-
[Related Work]
Priyamvada Paliwal, Meghna Sharma
Enhanced DBSCAN Outlier Detection
International Journal of Advanced Research in Computer Science and Software Engineering, Volume 3, Issue 3, March 2013, ISSN: 2277 128X
[ Paper ~0.41 MB (pdf) ]
Abstract:
Most real-world databases include a certain amount of exceptional values, generally termed as “outliers”.The DBSCAN algorithm can identify clusters in large spatial data sets by looking at the local density of database elements, using only one input parameter. This paper presents a comprehensive study of Outlier Detection and DBSCAN , algorithm The salient of this paper to present enhanced DBSCAN algorithm with its implementation with the complexity. And there are also additional features described with this algorithm for finding outliers.
-
Big Data Analytics - Overview Slides
Morris Riedel
Lecture 1 - Support Vector Machines
Online Material
[ Slides ~0.82 MB (pdf) ]
Abstract:
Support Vector Machines (SVMs) are one concrete technique to perform big data analytics. They not only provide a high average rate of success, but also have a relatively fast training time compared to other classification techniques (e.g. Bayes Nets, Classification Trees, etc.). This online material summarizes key elements of SVMs and provides insights into its various application areas.
-
Morris Riedel
Lecture 2 - Data Mining & Analysis Process Models
Online Material
[ Slides ~0.61 MB (pdf) ]
Abstract:
A data analytics Project should be supported by a well defined underlying process. There are a wide variety of data mining and analysis process models used. This online material summarizes key elements of the so-called Cross-Industry Standard Process for Data Mining (CRISP-DM) and its six well-defined phases (business understanding, data understanding, data preparation, modeling, evaluation, deployment).
-
Morris Riedel
Lecture 3 - Data Analytics for Personalized Medicine
Online Material
[ Slides ~1.18 MB (pdf) ]
Abstract:
Traditional clinical diagnosis offers a reactive approach to treat illnesses so that medication usually starts after signs/symptoms appear. Different types of data of the individual patient's clinical signs/symptoms are taken into account, including medical/family history, data from laboratories, or imaging evaluations. In contrast, this online material will provide information about data analytics in personalized medicine, which is a newer medical model with the idea of customization of healthcare meaning medical decisions, practices, and products are 'tailored to individual patients'. It describes the use of genetic Information data that plays a key role in several aspects of personalized medicine.
-
Morris Riedel
Lecture 4 - Research Challenges
Online Material
[ Slides ~0.62 MB (pdf) ]
Abstract:
The emergence of 'Big Data' and applying 'analytics techniques' in context raises new research challenges. This collection of research challenges provides a rough overview of identified problems as well as approaches to solutions. It gives a hint to various analytics techniques in context, including predictive modeling, classifiers, and the application of machine learning algorithms.
-
Morris Riedel
Lecture 5 - Tools and Techniques
Online Material
[ Slides ~0.47 MB (pdf) ]
Abstract:
Big data analytics requires tools that are used for pre-processing the data, applying algorithms to data, or to support the post-processing of the data. This lecture surveys relevant data analysis tools (e.g. statistical computing with R), packages (e.g. RMPI) and techniques (statistics, data mining, machine learning, evolutionary optimization, etc.) in order to give an overview. Selected application use cases are given in context with further references.
-
Morris Riedel
Lecture 6 - Outlier Detection
Online Material
[ Slides ~0.61 MB (pdf) ]
Abstract:
Several methods in big data analytics are used to perform outlier detection of data whereby an 'outlier' can be explained as an observation that appears to deviate markedly from other observations. This lecture surveys a wide variety of approaches (e.g. distance-based, density-based) using various outlier algorithms and techniques. Concrete examples using statistical computing with R are given in context.
-
Morris Riedel
Big Data Analytics Tools - FactSheets
Online Material
[ LIBSVM Library - Support Vector Machine (SVM) Library Tool ~0.14 MB (pdf) ]
[ RMPI Package - Parallel Processing within the R Statistical Computing Tool ~0.17 MB (pdf) ]
[ WEKA - Java Machine Learning/Data Mining Algorithms & Tools ~0.18 MB (pdf) ]
Abstract:
Big data analytics requires tools that are used for pre-processing the data, applying algorithms to data, or to support the post-processing of the data. This collection of the factsheets provides quick overviews meant as guidance for users and we refer in each factsheet to more pieces of information. The Tools have been created by many researchers world-wide and are used in many data analytics use case applications.
-
Morris Riedel
Big Data Analytics KnowHow - TensorFlow - Top 3 Facts
  • TensorFlow is a programming system designed to help researchers build deep neural networks.
  • It enables one to easily 'script' a dataflow computation where the basic units of computing are very large multi-dimensional arrays ( aka Tensors).
  • The computation you build with tensors are compiled into graphs that are executed according to the 'dataflow' paradigm.

-

Big Data Analytics - Generic Frameworks - Related Work
-
A. Mamatha, Polepalli Krishna Reddy, Mittapally Kumara Swamy, G. Sreenivas, D. Raji Reddy
A Framework to Improve Reuse in Weather-Based Decision Support Systems
in Proceedings of Third International Conference Big Data Analytics (BDA) 2014, New Delhi, India, December 20-23-,2014, LNCS 8883, Springer
[ Proceedings Chapter ~2.4 MB (pdf) ]
Abstract:
The systems for weather observation and forecast are being operated to deal with adverse weather in general to mankind. Weather-based decision support systems (DSSs) are being build to improve the efficiency of the production systems in the domains of health, agriculture, livestock, transport, business, planing, governance and so on. The weather-based DSS provides appropriate suggestions based on the weather condition of the given period for the selected domain. In the literature, the notion of reuse is being employed in improving the efficiency of DSSs. In this paper, we have proposed a framework to identify similar weather conditions, which could help in improving the performance of weather-based DSSs with better reuse. In the proposed framework, the range of weather variable is divided into categories based on its influence on that domain. We form a weather condition for a period which is the combination of category values of weather variables. By comparing the daily/weekly weather conditions of a given year to weather conditions of subsequent years, the proposed framework identifies the extent of reuse. We have conducted the experiment by applying the proposed framework on 30 years of weather data of Rajendranagar, Hyderabad and using the categories employed by India Meteorological Department in Meteorology domain. The results show that there is a significant degree of similarity among daily and weekly weather conditions over the years. The results provide an opportunity to improve the efficiency of weather-based DSSs by improving the degree of reuse of the developed suggestions/knowledge for the corresponding weather conditions.
-
Tags: decision support systems; simple weather and climate; data analysis
-
Big Data Analytics - Map-Reduce Paradigm - Related Work
-
Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng, and Kunle Olukotun
Map-Reduce for Machine Learning on Multicore
in Advances in Neural Information Processing Systems (19), 2007
[ Paper ~0.46 MB (pdf) ]
Abstract:
We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time. Specifically, we show that algorithms that fit the Statistical Query model [15] can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers. We adapt Google’s map-reduce [7] paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors.
-
Tags: map-reduce; machine learning; algorithms; multi-core
-
Big Data Analytics - Support Vector Machines - Related Work
-
Yoonkyung Lee and Cheol-Koo Lee
Classification of multiple cancer types by multicategory support vector machines using gene expression data
in Bioinformatics 19.9,pp.1132-1139, 2003
[ Paper ~0.16 MB (pdf) ]
Abstract:
High-density DNA microarray measures the activities of several thousand genes simultaneously and the gene expression profiles have been used for the cancer classification recently. This new approach promises to give better therapeutic measurements to cancer patients by diagnosing cancer types with improved accuracy. The Support Vector Machine (SVM) is one of the classification methods successfully applied to the cancer diagnosis problems. However, its optimal extension to more than two classes was not obvious, which might impose limitations in its application to multiple tumor types. We briefly introduce the Multicategory SVM, which is a recently proposed extension of the binary SVM, and apply it to multiclass cancer diagnosis problems.
-
Tags: SVM; kernel methods; multi-class; classification; cancer application
-