Selected Talks

2016
Morris Riedel
TUTORIAL: Einführung in Maschinelles Lernen zur Datenanalyse
Invited Tutorial (German language)
Smart Data Innovation Conference, Karlsruhe Institute of Technology (KIT), Germany
2016-10-13
[ Lecture 1 - Grundlagen und Überblick - Slides ~1.5 MB (pdf) ]
[ Lecture 2 - Klassifikation von Daten in Anwendungen - Slides ~1.7 MB (pdf) ]
[ Event ]
Abstract:
Der Kurs vermittelt Grundlagen zur Analyse von Daten und ist an Kursbesucher gerichtet die keine Vorkenntnisse in diesem Bereich haben. Die Inhalte werden prinzipielle Techniken umfassen, um Methoden der Datenanalyse wie Clustering, Klassifikation oder Regression besser einzuordnen. Das beinhaltet auch ein Verständnis von Testdaten, Trainingsdaten und Validierungsdaten. Anhand von einfachen Beispielen werden weiterhin Probleme wie bspw. overfitting angesprochen sowie dessen Lösungsansätze Validierung und Regularisierung. Nach dem Kurs haben Teilnehmer das Verständnis wie man an Datenanalyseprobleme grundsätzlich herangeht. Außerdem wird Teilnehmern ein Überblick gegeben welche Techniken und Methoden auf welchen SDIL Platformen bereitstehen.
-
Morris Riedel
Machine Learning Tutorial for Supervised Classification using Support Vector Machines
Invited Tutorial
University of Barcelona
2016-07-06 - 2016-07-08
[ Lecture 1 - Machine Learning Fundamentals - Slides ~1.6 MB (pdf) ]
[ Lecture 2 - Supervised Classification - Slides ~1.6 MB (pdf) ]
[ Lecture 3 - Support Vector Machines - Slides ~0.9 MB (pdf) ]
[ Lecture 4 - Applications and Serial Computing Limits - Slides ~1.4 MB (pdf) ]
[ Lecture 5 - Kernel Methods - Slides ~1.6 MB (pdf) ]
[ Lecture 6 - Applications and Parallel Computing Benefits - Slides ~1.3 MB (pdf) ]
[ Event ]
Abstract:
The goal of this tutorial is to introduce participants to one concrete and widely used machine learning technique for analyzing complex datasets for scientific and engineering applications. After a brief introduction to the general approach of using machine learning, data mining, and statistics in data analysis, we start with the ‘supervised classification’ approach in which groups of datasets already exist and new data is checked in order to understand to which existing group it belongs. As one of the best out-of-the-box methods for classification we focus in this tutorial then on the Support Vector Machine (SVM) algorithm including selected kernel methods. The Serial LibSVM implementation will be used to illustrate Limits of Serial computing (e.g. memory limits, slow computing time, etc.) and a parallel and scalable SVM implementation, based on MPI, will be used during the hands-on session with a couple of challenging datasets. Learning outcomes are: (a) learn the fundamentals of machine learning, (b) obtain basic understanding of supervised classification and (c) learn support vector machines and apply them to datasets.
-
2015
Morris Riedel
Scientific Big Data Analytics - Pointers to Collaboration Opportunities
4th Joint Laboratory for Extreme Scale Computing (JLESC) Workshop
Bonn, Germany, 2015
2015-12-03
[ Slides ~1.5 MB (pdf) ] [ Event ] [ Juelich ]
Abstract:
The goal of this talk is to inform participants about the concept idea of scientific big data analytics driven by HPC. Two concrete and widely used data analytics techniques that are suitable to analyse ‘big data’ for scientific and engineering applications will be introduced. From the broad class of available clustering methods we focus on the density-based spatial clustering of applications with noise (DBSCAN) algorithm that also enables the identification of outliers or interesting anomalies. A parallel and scalable DBSCAN implementation, based on MPI/OpenMP and the hierarchical data format (HDF), will be discussed in the context of interesting scientific datasets. As one of the best out-of-the-box methods for classification the support vector machine (SVM) algorithm including kernel methods will be a focus. A parallel and scalable SVM implementation, based on MPI, will be described in detail by using a couple of challenging scientific datasets and smart feature extraction methods.
-
Morris Riedel
High Performance Computing for Science & Engineering
Iceland Science Day 2015
Reykjavik, Iceland, 2015
2015-10-31
[ Slides ~0.94 MB (pdf) ] [ Event ] [ Juelich ]
Abstract:
This talk informs about research and societal impact of High Performance Computing (HPC) that refers to techniques that deliver much higher performance than personal computers in order to solve problems in science and engineering with ‘simulation sciences’ using HPC systems with 500.000 cores today.
-
Morris Riedel
Applications of Clustering for Large-Scale Datasets
Research Data Alliance (RDA) Sixth Plenary Meeting, Big Data Interest Group Session
CNAM, Paris, France, 2015
2015-09-25
[ Slides ~0.6 MB (pdf) ] [ Juelich ]
Abstract:
The ultimate goal of RDA Big Data Interest Group is to produce a set of recommendation documents to advise diverse research communities with respect to: (a) 'How to select an appropriate Big Data solution for a particular science application with optimal value?' and (b) 'What are the best practices in dealing with various data and computing issues associated with such a solution?'. This talk informs the RDA community about a particular solution for big data analysis and analytics for large-scale clustering problems. The evaluation of the solution is given by selected applications and linked publications.
-
Morris Riedel
Scalable Developments for Big Data Analytics in Remote Sensing
IEEE 36th Symposium on Remote Sensing (IGARSS)
Milan, Italy, 2015
2015-07-28
[ Slides ~0.86 MB (pdf) ] [ Event ] [ Juelich ]
Abstract:
Big Data Analytics methods take advantage of techniques from the fields of data mining, machine learning, or statistics with a focus on analysing large quantities of data (aka ’big datasets’) with modern technologies. Big datasets appear in remote sensing in the sense of large volumes, but also in the sense of an ever increasing amount of spectral bands (i.e. high-dimensional data). The remote sensing field has traditionally used the above described techniques for a wide variety of application fields such as classification (e.g. land cover analysis using different spectral bands from satellite data), but more recently scalability challenges occur when using traditional (often serial) methods. This paper addresses observed scalability limits when using support vector machines (SVMs) for classification and discusses scalable and parallel developments used in concrete application Areas of remote sensing. Different approaches that are based on massively parallel methods are discussed as well as recent developments in embarassingly parallel methods.
-
Morris Riedel
Data Analytics - Machine Learning - Tutorial
2nd Joint Laboratory for Extreme Scale Computing (JLESC) Summerschool 2015, Invited Talk
University of Barcelona
2015-07-03
[ Welcome & Statistical Learning Theory Basics Slides ~0.83 MB (pdf) ] [ Juelich ]
[ Classification Slides ~1.36 MB (pdf) ] [ Juelich ]
[ Event ]
Abstract:
The goal of this tutorial is to introduce participants to two concrete and widely used data analytics techniques for analyzing ‘big data’ for scientific and engineering applications. After a brief introduction to the general approach of using machine learning, data mining, and statistics in data analytics, we start with the ‘clustering’ technique that partitions datasets into subgroups (i.e. clusters) previously unknown. From the broad class of available methods we focus on the Density-based spatial clustering of applications with noise (DBSCAN) algorithm that also enables the identification of outliers or interesting anomalies. A parallel and scalable DBSCAN implementation, based on MPI/OpenMP and the hierarchical data format (HDF), will be used during the hands-on session with various interesting datasets. The second technique we cover is ‘classification’ in which groups of datasets already exist and new data is checked in order to understand to which existing group it belongs. As one of the best out-of-the-box methods for classification we focus on the Support Vector Machine (SVM) algorithm including selected kernel methods. A parallel and scalable SVM implementation, based on MPI, will be used during the hands-on session with a couple of challenging datasets. Both used HPC algorithms will be compared with solutions based on high throughput computing (i.e. map-reduce, Hadoop, Spark/MLlib, etc.) and serial approaches (R, Octave, Matlab, etc.).
-
Morris Riedel
On Parallel and Scalable Classification and Clustering Techniques for Earth Science Datasets
Sixth Workshop on Data Mining in Earth System Science (DMESS 2015), International Conference on Computational Science (ICCS)
Reykjavik, Iceland, 2015
2015-06-02
[ Slides ~1.9 MB (pdf) ] [ Event ] [ Juelich ]
Abstract:
One of the observations made in earth data science is the massive increase of data volume (e.g, higher resolution measurements) and dimensionality (e.g. hyper-spectral bands). Traditional data mining tools (Matlab, R, etc.) are becoming redundant in the analysis of these datasets, as they are unable to process or even load the data. Parallel and scalable techniques, though, bear the potential to overcome these limitations. In this contribution we therefore evaluate said techniques in a High Performance Computing (HPC) environment on the basis of two earth science case studies: (a) Density-based Spatial Clustering of Applications with Noise (DBSCAN) for automated outlier detection and noise reduction in a 3D point cloud and (b) land cover type classification using multi-class Support Vector Machines (SVMs) in multispectral satellite images. The paper compares implementations of the algorithms in traditional data mining tools with HPC realizations and ’big data’ technology stacks. Our analysis reveals that a wide variety of them are not yet suited to deal with the coming challenges of data mining tasks in earth sciences.
-
Morris Riedel
Selected Parallel and Scalable Methods for Scientific Big Data Analytics
ZIH-Kolloquium, Invited Talk
Technical University of Dresden
2015-05-21
[ Slides ~1.72 MB (pdf) ] [ Event ] [ Juelich ]
Abstract:
The goal of this talk is to inform participants about two concrete and widely used data analytics techniques that are suitable to analyse ‘big data’ for scientific and engineering applications. After a brief introduction to the general approach of using machine learning, data mining, and statistical computing in data analytics, the talk will offer details on the ‘clustering’ technique that partitions datasets into subgroups (i.e. clusters) previously unknown. From the broad class of available methods we focus on the density-based spatial clustering of applications with noise (DBSCAN) algorithm that also enables the identification of outliers or interesting anomalies. A parallel and scalable DBSCAN implementation, based on MPI/OpenMP and the hierarchical data format (HDF), will be discussed in the context of interesting scientific datasets. The second technique that the talk will adress is ‘classification’ in which groups of datasets already exist and new data is checked in order to understand to which existing group it belongs. As one of the best out-of-the-box methods for classification the support vector machine (SVM) algorithm including kernel methods will be a focus. A parallel and scalable SVM implementation, based on MPI, will be described in detail by using a couple of challenging scientific datasets and smart feature extraction methods. Both aforementioned high performance computing algorithms will be compared with solutions based on a variety of high throughput computing techniques (i.e. map-reduce, Hadoop, Spark/MLlib, etc.) and serial approaches (R, Octave, Matlab, Weka, scikit-learn, etc.).
-
Morris Riedel
Enabling Parallel and Scalable Tools for Scientific Big Data Analytics
AixCAPE Spring Meeting 2015
Aachen, Germany
2015-05-06
[ Slides ~1.78 MB (pdf) ] [ Juelich ]
Abstract:
Selected research challenges of the Juelich Supercomputing Centre (JSC) of the Forschungszentrum Juelich in the context of ‘big data’ in general and ‘analytics’ in particular will be described. A wide variety of solutions based on a high throughput computing (HTC) techniques (i.e. map-reduce, Hadoop, Spark/MLlib, etc.) and serial approaches (R, Octave, Matlab, Weka, scikit-learn, etc.) are available but do not address fully the requirements that arise in scientific and engineering applications. The talk therefore offer insights why high performance computing (HPC) techniques, often rather driven by the simulation sciences, bear the potential for efficient and effective analytics solutions as well. After a brief introduction to the approach of using machine learning, data mining, and statistical computing in data analytics, the talk will offer details on the ‘clustering’ technique that partitions datasets into subgroups (i.e. clusters) previously unknown. From available methods the talk will focus on the density-based spatial clustering of applications with noise (DBSCAN) algorithm that also enables the identification of outliers or interesting anomalies. A parallel and scalable DBSCAN implementation, based on MPI/OpenMP and the hierarchical data format (HDF), will be discussed in the context of interesting datasets. The second technique that the talk will adress is ‘classification’ in which groups of datasets already exist and new data is checked in order to understand to which existing group it belongs. As one of the best out-of-the-box methods for classification the support vector machine (SVM) algorithm including kernel methods will be a focus. A parallel and scalable SVM implementation, based on MPI, will be described by using a couple of challenging scientific datasets and smart feature extraction methods.
-
Morris Riedel
Scientific Big Data Analytics @JSC & Expressions of Interest
Wissenschaftliches Vortragsprogramm bei den Sitzungen der Rechenzeitkommission des NIC und des Wissenschaftlichen Rates des NIC
Zeuthen, Germany
2015-04-17
[ Slides ~1.4 MB (pdf) ] [ Juelich ]
Abstract:
The importance of data analytics, management, sharing and preservation of very big, often heterogeneous or distributed data sets from experiments, observations and simulations --‐ besides the basic technical requirements transfer and storage --‐ is of increasing significance for science, research and industry. This development has been recognized by many research institutions among them leading HPC centres. They want to advance their support for researchers and engineers using Scientific Big Data Analytics (SBDA) by HPC. The John von Neumann Institute for Computing (NIC), a joint foundation of the three Helmholtz Centres Forschungszentrum Jülich, Deutsches Elektronensynchrotron DESY, and GSI Helmholtzzentrum für Schwerionenforschung, invites Expressions of Interest (EoI) for SBDA projects using HPC to identify and analyse the needs of the scientific communities. The goal is to extend and optimize the HPC and data infrastructures and services in order to provide optimal methodological support. This talk informs about its strategic importance and feedback on received EoIs.
-
Morris Riedel
Smart Data Innovation Lab - Data Innovation Community Personalised Medicine
SDIL Data Innovation Community Meeting
Bayer Technology Services, Leverkusen, Germany
2015-03-03
[ Slides ~1.5 MB (pdf) ] [ Juelich ]
Abstract:
Modern medicine as well generates increasingly larger data quantities. Reasons for this comprise higher resolution data from state-of-the-art diagnostic methods like magnetic resonance imaging (MRI), IT controlled medical technology, comprehensive medical documentation and the ever more detailed knowledge about the human genome. A case in point: personalised cancer therapy. There, increasing use of software aims at taking terabytes of data from clinical, molecular and medication data in diverse formats and distilling from them effective treatment options for each individual patient in real-time, in order to significantly improve treatment results. Within the Smart Data Innovation Lab (SDIL) Data Innovation Community “Personalised Medicine”, important data-driven aspects of personalised medicine are to be explored, such as the need-driven care of patients, IT controlled medical technology or even Web-based patient care. The Data Innovation Community “Personalised Medicine” addresses all companies and research institutions interested in conducting joint research with regard to these aspects. This includes industry user companies and clinics but also companies from the automation and IT industries. This talk provides an introduction to the community and a short overview of current activities.
-
Morris Riedel
European Data Infrastructure
Big Data and Extreme Scale Computing (BDEC) 2015
Barcelona Sants, Barcelona, Spain
2015-01-28
[ Slides ~1.14 MB (pdf) ] [ Juelich ]
Abstract:
With ever increasing scales towards extreme-scales in high performance computing large quantities of data are continuing to grow leading to a wide variety of challenges for computational sciences. Solutions are discussed in the talk of using analytics and visualizations in parallel - in-situ - to the extreme-scale simulation run in order to reduce, validate, or to even only understand the complex scientific datasets generated. Exchanging and sharing those data, replicating and archiving for later re-use, or permanently linking the data within publications are just a few challending areas for which the European Data Infrastructure EUDAT offers selected tools. The talk motivates this problem space towards extreme scales and outlines a potential set of tools and methods. one of the feasible sustainable approaches is known as a collaborative data infrastructure that encourages trust in users and, among other benefits, enables the removal of duplicate datasets.
-
Morris Riedel
Big Data in HPC, Hadoop, and HDFS
Invited Talk
Cy-Tera/LinkSCEEM HPC Administrator Workshop, The Cyprus Institute, Nicosia, Cyprus
2015-01-21
[ Part One Slides ~1.96 MB (pdf) ] [ Juelich ]
[ Part Two Slides ~1.99 MB (pdf) ] [ Juelich ]
Abstract:
One of the solutions to enable scalable 'big data' analysis and analytics is to take advantage of parallelization techniques. The talk differentiates between two paradigms that is on the one hand the massively parallel paradigm known in High Performance Computing (HPC) using techniques such as the Message Passing Interface (MPI) and OpenMP and on the other hand the map-reduce paradigm using rather pleasently parallel approaches. The first part of the talk focusses on 'Big Data in HPC' using two concrete codes as examples: (1) clustering using a parallel and scalable DBSCAN implementation and (2) classification using a parallel and scalable Support Vector Machine (SVM) implementation. The second part focusses on 'Big Data in Hadoop (based on the map-reduce processing paradigm) and its Hadoop Distributed File System (HDFS)' using known examples from text analysis such as wordcount. In between the material comparisons are given such as distributed filesystems vs. parallel filesystems or configuration elements important for HPC administrators. The talk ends with offering future topics in the context of big data analytics (e.g. in-situ analytics in exascale computing) or big data management challenges for reproducability of HPC & map-reduce runs required for future publications based on open referencable data.
2014
Morris Riedel
Arbeitsprozess und Notwendigkeit einer RDA IG am Beispiel der Interest Group Big Data Analytics
RDA – Deutschland Treffen
Deutsches GeoForschungszentrum, Potsdam, Deutschland
2014-11-20, German Language
[ Slides ~1.72 MB (pdf) ] [Juelich ]
Abstract:
Die Research Data Alliance (RDA) ist eine Organisation die die Verfügbarmachung von Forschungsdaten für die breite Öffentlichkeit fördert. Ihre Arbeitsgruppen reichen von Dateninteroperabilitätsfragen, Terminologie, bis Persistente Identifikatoren. Dieser Vortrag fasst die Arbeitsweise, Herangehensweise und erste Resultate der RDA Big Data Analytics Arbeitsgruppe zusammen. Der Fokus dieser RDA Gruppe ist die Prozessierung von Forschungsdaten durch geeignete Frameworks und Parallelisierungsansätze um auf "Big Data" Datensätze zu skalieren. Dies beinhaltet verschiedene Methoden (machine learning, data mining, etc.), unterschiedliche Parallelisierungs-paradigmen (HPC, HTC, etc.) und verschiedene Technologien (MPI, OpenMP, Map-reduce, GPGPUs, etc.) um das Problem einer wissenschaftlichen Anwendung optimal zu lösen.
-
Morris Riedel, Jules Wolfrat
Understanding PRACE in the light of Data Sharing and Interoperability & RDA Relevance
RDA Europe Workshop about Data Sharing and Interoperability
Auditorium Tour Madou, Place Madou 1 B – 1210 Saint Josse-Ten-Noode, Brussels
2014-10-17
[ Slides ~0.46 MB (pdf) ] [ Juelich ]
Abstract:
The Partnership for Advanced Computing in Europe (PRACE) offers scientific communities access to large-scale HPC systems based on peer-reviewed-based scientific cases. These cases create or use an ever increasing amount of data raising a couple of challenges especially towards extreme-scale computing. This talk captures the top 5 user requirements from various scientific domains including support for the use of Persistent Identifiers (PIDs), sharing of high-quality metadata and data for re-use, high performance data transfers, increasing use of statistical data analysis tools, and the need for federated authentication & authorization in order to seamlessly work with world-wide infrastructures.
-
Morris Riedel
Big Data in Science- Overview of European & International Activities
HPC PROSPECT Meeting
Forschungszentrum Juelich, Juelich Supercomputing Centre, Rotunde, Germany
2014-10-10
[ Slides ~0.46 MB (pdf) ]
Abstract:
'Big Data' is a term that is often associated with commercial use cases related to customer segmentation, recommendation systems, or the analysis of shopping data. Nevertheless, large quantities of datasets can be also found in scientific environments and a wide variety of activities have been found to engage in getting insights from 'big data'. This talk provides an overview about European and international activities covering the European Data Infrastructure EUDAT, the international Research Data Alliance (RDA), and the German Smart Data Innovation Lab (SDIL).
-
Morris Riedel
From Big Data Analytics To Smart Data Analytics With Parallelization Techniques
IEEE Meeting, Invited Talk
University of Iceland, Reykjavik, Iceland
2014-09-16
[ Slides ~1.97 MB (pdf) ] [ Juelich ]
Abstract:
The massively increasing amount of often geographically dispersed large quantities of data of experiments, observations, or computational simulations become ever more important for science, research, industry and governments. Scientists and engineers that analyse these massive datasets require therefore reliable infrastructures as well as scalable tools in order to perform ‘scientific big data analytics (SBDA)’ using parallelization techniques. This talk will provide insights what infrastructure types are available in order to take advantage of such parallel methods, including high performance computing, high throughput computing, and cloud computing approaches and capabilities. It will survey selected parallel tools that enable a scalable data analysis and realistic computational simulations also motivating the current trend towards hybrid modelling and scientific and engineering use cases in which large-scale computing gets more intertwined with traditional data analysis.
-
Morris Riedel
Scientific Big Data Analytics - Practice & Experience
Keynote Presentation
The International Conference on Cloud and Autonomic Computing (CAC 2014) , Imperial College, London September 8-12, 2014
2014-09-09
[ Slides ~1.73 MB (pdf) ] [ Juelich ]
Abstract:
Data transfer, storage management, sharing, curation and most notably data analysis of often geographically dispersed large quantities of data of experiments, observations, or computational simulations become ever more important for science, research, industry and governments. Scientists and engineers that analyse these massive datasets require therefore reliable infrastructures as well as scalable tools in order to perform ‘scientific big data analytics (SBDA)’. This keynote will take stock of selected scientific and engineering use cases that take advantage of parallel machine learning algorithms (e.g. classification, clustering, regression) in combination with established statistical data mining methods in the light of new challenges faced with ‘big data’. It will critically review practice and experience of selected community approaches and thus address several important questions: Is big data always better data for analytics? Are big data analytics frameworks really providing the functionality they promise or scientists require? How can the scientific big data analytics process be properly structured? What is the role of the Research Data Alliance and Open Grid Forum in this context? Do we need a peer-review process for steering the scientific big data analytics applications and evolution when using valuable storage and compute resources?
-
M. Riedel, G. Cavallaro, J.A. Benediktsson, M. Goetz, T. Runarsson, K. Jonasson, Th. Lippert
Smart Data Analytics Methods for Remote Sensing Applications
Proceedings of the IEEE 35th Canadian Symposium on Remote Sensing (IGARSS)
Quebec, Canada, 2014
2014-07-15
[ Slides ~1.18 MB (pdf) ] [ Juelich ]
Abstract:
The big data analytics approach emerged that can be interpreted as extracting information from large quantities of scientific data in a systematic way. In order to have a more concrete understanding of this term we refer to its refinement as smart data analytics in order to examine large quantities of scientific data to uncover hidden patterns, unknown correlations, or to extract information in cases where there is no exact formula (e.g. known physical laws). Our concrete big data problem is the classification of classes of land cover types in image-based datasets that have been created using remote sensing technologies, because the resolution can be high (i.e.large volumes) and there are various types such as panchromatic or different used bands like red, green, blue, and nearly infrared (i.e. large variety). We investigate various smart data analytics methods that take advantage of machine learning algorithms (i.e. support vector machines) and state-of-the-art parallelization approaches in order to overcome limitations of big data processing using non-scalable serial approaches.
-
M. Riedel
Interest Group Big Data Analytics
Koordinationsgespräch Helmholtzgemeinschaft (HGF) – Research Data Alliance (RDA)
HGF Geschäftsstelle Berlin, German Language
2014-07-07
[ Slides ~1.05 MB (pdf) ] [ Juelich ]
Abstract:
Der Begriff "Big Data Analytics" steht oft im Zusammenhang mit der Analyse von kommerziellen Daten (bspw. Kundensegmentierung, Recommendersysteme, etc.). Viele Berichte erwähnen im wissenschaftlichen Bereich ebenfalls diesen Begriff, jedoch ist nicht ganz klar welche Methoden, Technicken und Ansätze wirklich im wissenschaftlichen Bereich eingesetzt werden da die Wissenschaft oft andere Ansprüche an die Datenanalyse stellt (bspw. Kausalität). Dieser Vortrag beschreibt Aktivitäten der Research Data Alliance (RDA) Arbeitsgruppe "Big Data Analytics" die durch konkrete wissenschaftliche Fragestellungen versucht ein klareres Bild von diesem Begriff in der Wissenschaft herauszuarbeiten. Dieser Vortrag betrachtet dabei die Relevanz für die Großforschung in der Helmholtz-Gemeinschaft, aber auch warum es wichtig ist sich in dem "Big Data" Bereich zu fokussieren.
-
M. Riedel, A. Memon, M. Memon
High Productivity Data Processing Analytics Methods with Applications
Proceedings of the 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)
Opatija, Croatia, 2014
2014-05-29
[ Slides ~1.94 MB (pdf) ] [ Juelich ]
Abstract:
The term ‘big data analytics’ emerged in order to engage in the ever increasing amount of scientific and engineering data with general analytics techniques that support the often more domain-specific data Analysis process. It is recognized that the big data challenge can only be adequately addressed when knowledge of various different fields such as data mining, machine learning algorithms, parallel processing, and data Management practices are effectively combined. This paper thus describes some of the ‘smart data analytics methods’ that enable a high productivity data processing of large quantities of scientific data in order to enhance the data analysis efficiency. The paper thus aims to provide new insights how various fields can be successfully combined. Contributions of this paper include the concretization of the cross-industry standard process for data mining (CRISPDM) process model in scientific environments using concrete machine learning algorithms (e.g. support vector machines that enable data classification) or data mining mechanisms (e.g. outlier detection in measurements). Serial and parallel approaches to specific data analysis challenges are discussed in the context of concrete earth science application data sets. Solutions also include various data visualizations that enable a better insight in the corresponding data analytics and analysis process.
-
Morris Riedel
Research Data Alliance: Understanding Big Data Analytics Applications in Earth Science
European Geosciences Union (EGU) General Assembly 2014
Austria Center Vienna, Vienna, Austria
2014-04-29
[ Slides ~0.67 MB (pdf) ] [ Juelich ]
Abstract:
The Research Data Alliance (RDA) enables data to be shared across barriers through focused working groups and interest groups, formed of experts from around the world – from academia, industry and government. Its Big Data Analytics (BDA) interest groups seeks to develop community based recommendations on feasible data analytics approaches to address scientific community needs of utilizing large quantities of data. BDA seeks to analyze different scientific domain applications (e.g. earth science use cases) and their potential use of various big data analytics techniques. These techniques reach from hardware deployment models up to various different algorithms (e.g. machine learning algorithms such as support vector machines for classification). A systematic classification of feasible combinations of analysis algorithms, analytical tools, data and resource characteristics and scientific queries will be covered in these recommendations. This contribution will outline initial parts of such a classification and recommendations in the specific context of the field of Earth Sciences. Given lessons learned and experiences are based on a survey of use cases and also providing insights in a few use cases in detail.
-
Morris Riedel
EUDAT – Towards A Pan-European Collaborative Data Infrastructure
Vlaams Supercomputing Centrum (VSC) User Day
International Auditorium, Brussels, Belgium
2014-01-16
[ Slides ~1.60 MB (pdf) ] [ Juelich ]
Abstract:
The constantly growing amounts of global, diverse, complex, but extremely valuable scientific data is an opportunity, but also a major challenge for research. In recent years, several pan-European e-Infrastructures and a wide variety of research infrastructures have been established supporting multiple research communities. But the accelerated proliferation of data arising from powerful new scientific instruments, scientific simulations and digitization of library resources, for example, have created a more urgent demand for increasing efforts and investments in order to tackle the specific challenges of data management and to ensure a coherent approach to research data access and preservation. A vision of a ‘collaborative data infrastructure’ for science was outlined by the European high level expert group on scientific data listing 12 high level requirements and 24 challenges to overcome. In this talk, we take stock of activities of the pan-European EUDAT collaborative data infrastructure that aims to address these challenges and exploit new opportunities to satisfy many of the high level requirements with concrete data services. Data Analytics techniques in context will be highlighted (e.g. machine learning algorithms, statistical data mining approaches, etc.) in order to advance in science and engineering in ways not possible before.
2013
Morris Riedel
On Establishing Big Data Breakwaters with Analytics
American Geophysical Union (AGU) Fall Meeting 2013
10th December 2013, San Francisco, USA
[ Slides ~2.24 MB (pdf) ]
Abstract:
Many reports mentioning 'Big Data Waves' as one of the key problems in the next decade mentioning many approaches some of which are referred to as 'surfboards'. This talk motivates the necessity to establish the next concrete steps named as 'breakwaters'. The meaning of 'breakwaters' is here that a lot of 'hype' exists around big data that everything needs to be new (e.g. NoSQL databases vs. traditional SQL approaches, Map-Reduce instead of MPI, etc.), but the truth is that a clever combination of those techniques is needed in many use cases. In order to create more evidence and a more clearer picture, the Research Data Alliance (RDA) Big Data Analytics (BDA) group has been created. This talk provides a general introduction to the problem domain of 'Big Data' and then gives some insights around the RDA BDA activities, motiviations, and some use cases. This talk finishes with emphasizing on the necessity of 'data scientists' that are needed in order to contribute to the next Generation 'exascale performance scales' where compute becomes much more intertwined with data.
-
Morris Riedel
Big Data - Status Quo & Trends
Ársfundur Verkfræðistofnunar Háskóla Islands
University of Iceland, Haskolatorg, 27th November 2013, Reykjavik, Iceland
[ Slides ~3.2 MB (pdf) ]
Abstract:
Big Data - the broadly known 'term' for extremely large quantities of data is changing the way how we live and work. Known through media and news the availability of Big Data as well as new technology advancements has contributed to new insights in science and engineering, but also provides new streams of revenues for companys or start-ups on a daily basis. This talk offers a relatively neutral overview of the currently uses Technology and approaches including some glimpse into future trends where computing gets much more intertwined with storage (i.e. in-Memory processing, in-situ analytics, etc.). The talk equally balances between examples of science and engineering and commercial approaches and industry examples.
-
Morris Riedel
Research Advances in Data-driven Science through Federation
Visit of Prof. Thom Dunning – National Center for Supercomputing Applications (NCSA)
Juelich Supercomputing Centre, 16th October 2013, Juelich, Germany
[ Slides ~7.4 MB (pdf) ]
Abstract:
Big Data - driven scientific research tends to require more and more systems that support the fact that computation & data analysis becomes more tightly intertwined as in the past. While we naturally need to focus on the 'scientific applications that work with big data', we should not loose sight that a new mix of skills and approaches are required to leverage the power of the aforementioned systems to the most possible degree. The mix consists of expertise in traditional scientific computing methods, most notably parallelization of problems, but also leveraging high performance computing (HPC) and high throughput computing (HTC) methods. These skills must be comprehended with capabilities in emerging data analytics approaches, optimized data access and Management (e.g. scalable I/O, in-memory computing, etc.), as well as statistical data mining and (emerging parallel) machine learning algorithms (e.g. classification, clustering, regression, etc.). This talk explains how the Juelich Supercomputing Centre is pursuing the work on Big Data - driven scientific research using the concept of 'federations' that stand for long-term collaborations to tackle specific scientific problem creating 'trust for data generators' and giving a focus on real problems. The talk concludes with a view towards exascale computers and their applications with combined characteristics of computational simulations, interactive visualization, and 'in-situ data analytics'.
-
Morris Riedel
Understanding Big Data Analytics in the Context of Scientific Computing
2nd Annual CHinese-AmericaN-German E-Science and cyberinfrastructure (CHANGES) Workshop
Host National Center for Supercomputing Applications (NCSA), 10.-12. September 2013, Chicago, USA
2013-09-11, Session 1 on Big Data Analytics
[ Event ] [ Slides ~31.5 MB (pdf) ]
Abstract:
The emerging methods around 'big data analytics' such as map-reduce, NoSQL databases, or approaches that generally depend on key-value pairs have gained momentum in the last couple of years. But although many speak from major changes in scientific computing as a result of 'big data analytics' it is not clear at all whether and in which different parts traditional methods of scientific computing will change. This talk thus aims to provide an overview of 'big data analytics' approaches at the intersection of scientific computing and 'big data' that enable a 'high productivity processing of research data'. Most notably this includes sharp views on scientific applications that use 'big data', but also includes topics such as traditional scientific computing methods, HPC & HTC paradigms, and parallelization as a general concept. The talk reviews some concrete emerging data analytics approaches (e.g. by using Apache Hadoop and the underlying Hadoop Distributed File System) in science (e.g. High Energy Physics), including also some glimpse on related statistical data mining and machine learning methods (e.g. using Apache Mahout, R-map-reduce, pbdR, etc.). The talk concludes with a methodology that is currently being established by the Big Data Analytics group of the Research Data Alliance (RDA) in order to enable a systematic analysis of scientific application use cases.
-
Morris Riedel
High Productivity Processing - Engaging in Big Data around Scientific Computing
Málstofa í tölvunarfræði og reiknifræði - Seminar in Computer and Computational Science
Faculty of Industrial Engineering, Mechanical Engineering, and Computer Science
University of Iceland, Reykjavik, Iceland
2013-01-09, 13:00-13:45 Room Naust, Endurmenntun
[ Faculty ] [ Slides ~6.17 MB (pdf) ]
Abstract:
The steadily increasing amounts of scientific data and the analysis of 'big data' is a fundamental characteristic in the context of computational simulations that are based on numerical methods or known physical laws. This represents both an opportunity and challenge on different levels for traditional scientific computing approaches, architectures, and infrastructures. On the lowest level data-intensive computing is a challenge since CPU speed has surpassed IO capabilities of HPC resources and on the higher levels complex cross-disciplinary data sharing is envisioned via data infrastructures in order to engage in the fragmented answers to societal challenges. This talk highlights how these levels share the demand for 'high productivity processing' of 'big data' including the sharing and analysis of 'large-scale science data-sets'. The talk will describe approaches such as the high-level European data infrastructure EUDAT as well as low-level requirements arising from HPC simulation labs that perform scientific computing on a daily basis in the Juelich Supercomputing Centre. The talk also aims to give an overview that despite the fact that big data analysis methods such as map-reduce, R, and others are around, a lot of research and evaluations still need to be done to achieve scientific insights with them in the context of traditional scientific computing environments.
2012
Morris Riedel
Nationale Forschungsinfrastrukturen im EU-Rahmen
Forschung – Information – Infrastruktur: Bausteine für Open Science
13. DINI-Jahrestagung, September 2012, Karlsruhe, Germany
2013-09-25, 9:30 Festsaal des Studentenhauses auf dem Campus Süd des Karlsruher Instituts für Technologie (KIT)
German Language
[ Event ] [ Slides ~10.4 MB (pdf) ]
Abstract (German):
Internationale, nationale, sowie regionale Forschungsinfrastrukturen (engl. Research Infrastructures) bilden heutzutage einen nennenswerten Beitrag zu „Open Science“ durch einen fortwährend steigenden Informations-austausch ihrer wissenschaftlichen Benutzer. Dieser Informationsaustausch ist technisch enorm komplex mit dem Ziel „Datensilos“ zu vermeiden und kann innerhalb einer wissenschaftlichen Disziplin oder multi-disziplinar erfolgen. Die ausgetauschten Daten liegen oft in vielen (proprietären) Datenformaten vor und die Verwaltung dieser Daten stellt hohe Anforderungen an eine IT Forschungsinfrastruktur. Persistente Datenobjekt-Identifikatoren, Interoperabilität, Metadaten, Verfahren zur Daten-Replikation, einfache Verfügbarkeit und vor allem Vertrauen in die bereitgestellten Informationen sind nur einige Beispiele dieser Anforderungen. Nationale Forschungsinfrastrukturen arbeiten mit anderen europäischen Partnern eng zusammen im Kontext der „European Strategy Forum on Research Infrastructures (ESFRI) Roadmap“ um (interoperable) Verfahren für diese Anforderungen gemeinsam zu entwickeln. Der Vortrag vermittelt die Komplexität von IT Forschungsinfrastrukturen anhand der Beispiele „Common Language Resources and Technology Infrastructure (CLARIN)“ und „Digital Research Infrastructure for the Arts and Humanities (DARIAH)“, die innerhalb Deutschlands zur europäischen ESFRI Roadmap beitragen. Darüber hinaus wird aufgezeigt wie diese Vorhaben mit Lösungsansätzen der europäischen „Collaborative Data Infrastructure EUDAT“ in Verbindung stehen.
-
Morris Riedel
Design and Applications of an Interoperability Reference Model for Production e-Science Infrastructures
Seminar at Steinbuch Centre for Computing
Karlsruhe Institute of Technology, Karlsruhe, Germany
2012-07-25
[ Centre ] [ Slides ~7.53 MB (pdf) ]
Abstract:
The term e-science evolved as a new research field that focuses on collaboration in key areas of science using next generation data and computing infrastructures (i.e. e-Science infrastructures) to extend the potential of scientific computing. More recently, increasing complexity of e-science applications that embrace multiple physical models (i.e. multi-physics) and consider a larger range of scales (i.e. multi-scale) is creating a steadily growing demand for world-wide interoperable infrastructures that allow for new innovative types of escience by jointly using different kinds of e-science infrastructures. This talk presents an infrastructure interoperability reference model (IIRM) design tailored to production needs and that represents a trimmed down version of the Open Grid Service Architecture (OGSA) in terms of functionality and complexity (i.e. based on lessons learned from TCP/IP vs. ISO/OSI), while on the other hand being more specifically useful for production and thus easier to implement.
-
Morris Riedel
Bedeutung von Interoperabilität und Standards in Grid Infrastrukturen
Guest Lecture about 'Interoperability' at Ludwig-Maximilians University of Munich
Course 'Grid Computing', Summer 2012, Munich, Germany
2012-07-04
German Language
[ University ] [ Slides ~3.13 MB (pdf) ]
Abstract (German):
Es gibt unterschiedliche Grid Infrastrukturen, die jeweils einen Fokus auf bestimmte Ressourcenarten oder Zugangsmechanismen haben. Computerbasierte Forscher sind daher an einem nahtlosen Zugang durch Interoperabilität zu den unterschiedlichen Grid Infrastrukturen interessiert, weil sie u.a. damit ihre wissenschaftlichen Methoden auf einem breiterem Spektrum von Ressourcen anwenden können. Die Interoperabilität klingt also logisch, ist aber sehr komplex und benötigt mehr als nur den Einsatz von einfachen Standards. Diese Gastvorlesung bringt den Sachverhalt dieser Komplexität näher wobei auf die Bedeutung on Middleware und Standards in Grid Infrastrukturen eingegangen wird. Außerdem wird der Referenzmodellansatz zur Unterstützung von Interoperabilität erläutert und unterschiedliche Sichtweisen auf die Vorteile von Interoperabilität aufgezeigt.
2011
Morris Riedel
e-Science Infrastructure Integration Invariants to Enable HTC and HPC Interoperability Applications
25th IEEE International Parallel & Distributed Processing Symposium (IPDPS) 2011
Conference presentation, Anchorage, Alaska, USA
2011-05-16
[ Event ] [ Slides ~5.10 MB (pdf) ]
Abstract:
During the past decade, significant international and broader interdisciplinary research is increasingly carried out by global collaborations that often share resources within a single production e-science infrastructure. More recently, increasing complexity of e-science applications embrace multiple physical models (i.e. multi-physics) and consider longer and more detailed simulation runs as well as a larger range of scales (i.e. multi-scale). This increase in complexity is creating a steadily growing demand for cross-infrastructure operations that take the advantage of multiple e-science infrastructures with a more variety of resource types. Since interoperable e-science infrastructures are still not seamlessly provided today we proposed in earlier work the Infrastructure Interoperability Reference Model (IIRM) that represents a trimmed down version of the Open Grid Service Architecture (OGSA) in terms of functionality and complexity, while on the other hand being more specifically useful for production and thus easier to implement. This contribution focuses on several important reference model invariants that are often neglected when infrastructure integration activities are being performed thus hindering seamless interoperability in many aspects. In order to indicate the relevance of our invariant definitions, we provide insights into two accompanying cross-infrastructure use cases of the bio-informatics and fusion science domain.
-
Morris Riedel
Overview of EMI Middleware
Joint European Distributed Computing Infrastructures - Summer School (DCISS)
Invited Talk at 'Computer and Automation Research Institute of the Hungarian Academy of Sciences', Budapest, Hungary
2011-07-11
[ Event ] [ Slides ~7.15 MB (pdf) ]
Abstract:
The Distributed Computing Infrastructure (DCI) projects (EGI-InSPIRE, European Middleware Initiative (EMI), Initiative for Globus in Europe (IGE), European Desktop Grid Initiative (EDGI), StratusLab and VENUS-C), funded under the e-Infrastructures topic of the FP7 "Capacities" Specific Programme, will provide a pan-European production infrastructure built from federated distributed resources, ensure the continued support, maintenance and development of the middlewares that are in common use in Europe, explore how grid sites and different applications can be hosted sustainably in commercial, public, publicly procured and private 'cloud computing' environments, and provide desktop resources to the European research community. The subject of the summer school is to give insights to the technologies provided by the EMI, IGE, EDGI and StratusLab projects. This particular invited talk provides a technical overview of the EMI project and its broad product portfolio used in distributed systems world-wide in daily science and engineering.
2010
Morris Riedel
Workspaces – Concept and functional aspects
A ‘You-tube for science‘ inspired by the High Level Expert Group Report on Scientific Data
The Language Archive - Repository and Workspace Workshop
Day 2 - Workspaces and Web Services
Invited Talk at Rechenzentrum Garching (RZG), Munich, Germany
2012-09-21
[ Event ] [ Slides ~3.42 MB (pdf) ]
Abstract:
There are a couple of science influencing factors such as the EU high level expert group report on scientific data, the emerging Research Infrastructures (RIs) of the European Strategy Forum on Research Infrastructures (ESFRIs), and the increased popularity of Web community building approaches, methods, and technologies. One vision to address the requirements arising from these factors is a 'YouTube for science' that is not driven by the common 'explorer thinking' with folders and files and rather provides a mechanism to 'dive into data infrastructures'. This invited talk provides pieces of information and a potential architecture on how such a vision can be implemented using the so-called 'workspaces' approach. The talk outlines key concepts and functional aspects of such workspaces well-embedded in emerging federated 'big data' architectures.
2009
Morris Riedel
The Seven Habbits of Highly Effective Cloud Standardization Processes
Some Suggestions based on Lessons Learned from Standardization and Interoperability of Next Generation Computing Infrastructures
Invited Talk at CloudCamp Berlin, Germany
2009-04-30
[ Event ] [ Slides ~2.16 MB (pdf) ] [ SlideShare ]
Abstract:
CloudCamp is an unconference where attendees can exchange ideas, knowledge and information in a creative and supporting environment, advancing the current state of cloud computing and related technologies. This invited talk provides brief recommendations to the cloud community based on lessons learned from standardization and interoperability of next generation computing infrastructures. The recommendations are given along the lines of seven steps inspired by 'The seven habbits of highly effective people' by Stephen R. Covey.
2008
Morris Riedel
e-Science with UNICORE
Invited Talk at International Summerschool on Grid Computing (ISSGC) 2008, Balatonfuered, Hungary
2008-07-08
[ Event ] [ Slides ~3.66 MB (pdf) ]
Abstract:
The International Summer School on Grid Computing is an institution in the grid computing community, proving itself year after year as a forum for fostering research and innovation in grid computing. Grid-interested students from all over the world are coming together to learn, to listen, and to add to efforts to make grid computing a globally recognised tool for enabling all-new ways of doing science. This particular talk provides students insights about how Grid middleware can be used with scientific applications in numerous ways. Example applications and approaches are provided by using the UNICORE middleware in particular, but the key methods behind the applications can be more applied to all Grid middleware in general.
-
Morris Riedel
Collaborative Interactivity in Parallel HPC Applications
Conference presentation 'Instrumenting the Grid (InGrid) 2008, Island of Ischia, Italy
2008-04-09
[ Event ] [ Slides ~4.16 MB (pdf) ]
Abstract:
Large-scale scientific research often relies on the collaborative use of massive computational power, fast networks, and large storage capacities provided by e-science infrastructures (e.g., DEISA, EGEE) since the past several years. Especially within e-science infrastructures driven by high-performance computing (HPC) such as DEISA, collaborative online visualization and computational steering (COVS) has become an important technique to enable HPC applications with interactivity and visualized feedback mechanisms. In earlier work we have shown a prototype COVS technique implementation based on the visualization interface toolkit (VISIT) and the Grid middleware of DEISA named as Uniform Interface to Computing Resources (UNICORE). Since then the approach grew to a broader COVS framework. More recently, we investigated the impact of using the computational steering capabilities of the COVS framework implementation in UNICORE on large-scale HPC systems (i.e., IBM BlueGene/P with 65536 processors) and the use of attribute-based authorization. In this chapter we emphasize on the improved collaborative features of the COVS framework and present new insights of how we deal with dynamic management of n participants, transparency of Grid resources, and virtualization of hosts of end-users. We also show that our interactive approach to HPC systems fully supports the necessary single sign-on feature required in Grid and escience infrastructures.