Vacancy expired!
- Experience running and optimizing Machine Learning / Artificial Intelligence workloads.Some experience with ML/AI tools, e.g., Pytorch, TensorFlow, etc.Serve as the primary contact for a GPU+CPU cluster.
- Collected team feedback and relayed to the support team (schedule downtimes/maintenance, propose changes to the cluster, etc.)Perform capacity planning to help determine compute/storage needs for the team moving forward.
- Serve as the owner of the SLURM job scheduler, defining the configuration that better fits the team and developing/enabling advanced featuresServe as the team datasets owner (manage the datasets that live in the cluster and how people access them).
- Help the team optimize/troubleshoot complex jobs/pipelines (AI centric, simulation, 3D graphics, etc.).
- Educate the team on how to use the cluster (SLURM, BeeGFS, datasets, etc.), enabling a fast ramp up time of new scientists and engineers (via tutorials, presentations, wiki docs, etc.) Desired skills:
- Good communication skills. You can effectively communicate with a variety of shareholders, including presenting plans to higher management and having technical discussions with engineers/scientists.
- Experience designing and managing large clusters with heterogeneous HW (CPUs, GPUs, etc.)User-centric and results oriented. You can learn from data what the needs of our scientists/engineers will and can produce a cluster growth plan to fulfill these needsPower user.
- You are willing to extensively test the different workflows that run in the cluster and help optimize them.Cluster tech stack.
- You are an expert on cluster orchestration and management, familiar with technologies such as SLURM, BeeGFS, Docker, etc. (or you are willing to learn them quickly)Minimum Educational Requirement: BS degree or higher