HPC Engineer

Arlington, VA, 1099, W2

The on-site HPC Engineer will assist in tasks and directions provided by the customer. The right candidate should have experience with HPC environments, a background in Ubuntu Linux, and knowledge of front end IP networking.

Responsibilities

  • Administration: Monitor, review, and manage Dell infrastructure listed in the SOW. Manage user requests. Manage and review log files. Generate regular operational reports. Provide capacity planning. Assist with disaster recovery planning and design.
  • Problem Management: Isolate and troubleshoot incidents. Perform service incident coordination. Open service requests on behalf of the Customer. Participate in root cause analysis review.
  • Change Management: Perform software/firmware management assistance and collaboration. Implement change management requests. Assist with solution documentation of policies and procedures in conjunction with the compliance manager(s) and with key stakeholders. Monitor migration activities.
  • Continual Service Improvement: Recommend procedure changes that result in operational optimization. Share best practices from other engagements. Provide performance tuning recommendations.
  • Post Implementation Planning and Knowledge Sharing: Work with customers technical leadership on an ongoing basis to ensure they have awareness of system status and discuss architectural design, strategies and plans for the future. Perform transition planning with deployment team. Perform incremental host and network configuration beyond deployment scope. Conduct knowledge transfer for new technology features, management and admin activities, and Standard Operating Procedures. Provide recommendations on product enhancements and upgrades. Implement Dell EMC System Management Tools. Work with customer staff to develop Run Books (document products and environment, including system information, code level, access instructions, configuration, “how tos”).
  • Change Evaluation and Recommendations: Review IT processes and policies (Incident, capacity, performance and change management, user, and back up policy) – as part of new solution or continuous improvement. Assist with the solution documentation of policies and procedures in conjunction with the compliance manager(s) and with other key stakeholders. Conduct knowledge transfer to address the Customer’s skills and resource gaps as well as technology recommendations.
  • HPC configuration, management and maintenance
  • Extended knowledge transfer
  • SID documentation (planning, cables, labeling, switches, etc.)
  • Addressing hardware failures and support tickets
  • Validation of firmware versions and settings
  • Validation of HPC software versions and settings
  • Validation of Dell HPC best practices
  • Assist with benchmark testing; High Performance Linpack (HPL), Alltoall, bidirectional bandwidth and Stream

Requirements

  • TS/SCI Clearance
  • Knowledge of gpfs storage
  • Experience Ubuntu and SLES
  • Network (IB & Enet) testing and management

Bonus Points

Join ClearedCollab

Apply for this Job

Upload your CV/resume or any other relevant file. Max. file size: 1 MB.