Los Alamos National Laboratory on Supercomputing and Monitoring at Scylla Summit 2017
Joshi Fullop of Los Alamos National Laboratory is giving a talk at Scylla Summit 2017 about high-performance computing. I had a lucky chance to interview him about his upcoming talk and wanted to share a preview of what to expect. Scylla Summit is a great opportunity to learn more about what is going on in the industry from a variety of talks from experts in the field. Let’s begin the interview.
Please tell us about yourself and what you do at Los Alamos National Laboratory?
I am an high-perfomance computing (HPC) scientist/engineer in scheduling, monitoring, and analytics at Los Alamos Nation Lab. I was formerly with the National Center for Supercomputing Applications (NCSA) where I worked for 19 years.
How did you get started in HPC and what do you like about working in HPC?
I started at NCSA at the University of Illinois in early 1998 where I worked on the Symera NT Supercluster project. That was an effort to create a COM-based infrastructure to do dynamic distributed computing in the Microsoft environment. Later, as things switched to Linux, I wrote a cluster monitoring system called CluMon that brought together job information with node-level performance metrics and displayed them visually via the web. It was quite basic but better than anything else available at the time.
Then, and even still today, there are very few avenues to get into HPC. There is no HPC certification, and if there was, it would likely be obsolete in a couple of years. Students have the best opportunities by getting involved in summer cluster Institute programs like the one put on here at Los Alamos.
Working in HPC is a constant thrill ride. One day you’ll be divining how to squeeze a little bit more performance out of a job, and the next day you’re architecting tomorrow’s supercomputer infrastructure. The stakes are high and one really has to think things through. ‘Oops-ing’ at scale on a multi-hundred million dollar machine is not something one wants to do with any frequency. HPC is a relatively small community filled with some amazingly talented individuals. I like to think that it is similar to working on the space program in the 60s.
What will you be talking about at Scylla Summit 2017?
How HPC is beginning to evolve and how we use supercomputers to monitor supercomputers.
What sort of tools are used for monitoring supercomputers?
Cluster monitoring efforts have evolved over the past couple of decades, driven by the ever increasing size of the machines being stood up. There are a number of server or small cluster monitoring tools out today. Looking at and tracking a few machines is not difficult. However, visualizing and understanding what is happening on 20 or 30 thousand nodes at once is a different story. It is a very delicate area of research as to accomplish the task. One needs to gather lots of data from numerous machines with little to no impact to that system. Once this data is collected, it has to be stored somewhere that it is readily usable. Data falls into two main categories: metrics and events. Metrics are measured data like system load, memory usage or bandwidth. Events are things that happen… think syslog. Unfortunately, we have had to create many of our own solutions over the years as we operate at a scale that very few have to contend with. Most tools are research projects that get bolstered up for production use. It is a costly way to go about the task, but sometimes we have no choice. Some very interesting tools that have come from this space are LDMS, a metric collection system; along with Baler and HELO, two similar machine-learning log classification systems that identify and tag log messages on the fly. We’ve used these tools on a number of very large scale machines. But, we are always evaluating new systems and technologies that we may employ to various ends. Fortunately more and more projects today are being developed with scalability in mind.
What type of audience will be interested in your talk?
The audience for my presentation is for anyone interested in supercomputers or monitoring. These are great topics that should gather interest from a diverse set of audiences.
Can you please tell me more about your talk?
First we will look at how HPC is different from cloud computing in terms of infrastructure and application architecture. Then I will discuss how those things are changing and why. Finally, I will dive into a use case of monitoring supercomputers as an application area for Scylla.
Is cloud computing taking advantage of HPC today? If so how?
Comparing HPC to Cloud computing is something I address in the talk, so I won’t go too deep into it here. It should suffice to say they are like teen siblings that share a lot of clothes but are involved in different sports.
How can the people get in touch with you?
The best way to get in touch with me is through email.
Thank you very much, Joshi. We can not wait to see your talk in person and learn more. If you want to attend Scylla Summit 2017 and enjoy more than 40 talks like this one, please register here.
Scylla Summit is taking place in San Francisco, CA on October 24-25. Check out the full agenda on our website to learn about the rest of the talks—including technical talks from the Scylla team, the Scylla roadmap, and a hands-on workshop where you’ll learn how to get the most out of your Scylla cluster.