AI Help¶

FAQ¶

For questions about specific software such as Python, OpenOnDemand, or Custom Installations, visit Applications FAQ

Accounts and Investment¶

Q: How to create a HiPerGator account?

A: HPG accounts cannot be created by users, but can be requested with a valid sponsor's approval. Please submit a request via https://www.rc.ufl.edu/get-started/hipergator/request-hipergator-account/

Q: How do I purchase HiPerGator resources or reinvest on expired allocations?

A: If you're a sponsor or account manager, please fill out a purchase form at https://www.rc.ufl.edu/get-started/purchase-allocation/

Q: How to add users to a group?

A: All users must submit a ticket via the RC Support Ticketing System with the Subject line in a format similar to "Add (username) to (groupname) group" in order to gain access to a given group.

Q: I can't login to my HPG account.

A: Visit our Blocked Accounts wiki page

Storage¶

Q: I can't see my (or my group's) /blue or /orange folders!

A: If you are listing /blue or /orange you won't see your group's directory tree. It's automatically connected (mounted) when you try to access it in any way e.g. by using an 'ls' or 'cd' command. E.g. if your group name is 'mygroup' you should list or cd into /blue/mygroup or /orange/mygroup. See also this short video: https://web.microsoftstream.com/video/87698fe6-84df-40dc-9d22-c3a6c63820fa

Q: Why do I see "No Space Left" in job output or application error?

A: If you see a 'No Space Left' or a similar message (no quota remaining, etc) check the path(s) in the error message closely to look for 'home', 'orange', 'blue', or 'red' and check the respective quota for that filesystem. All quota commands are in the 'ufrc' environment module and include 'home_quota', 'blue_quota', 'orange_quota'. See Getting Started and Storage for more help.

A convenient interactive tool to see what's taking up the storage quota is the ncdu command in a bash terminal. You can run that command and delete or move data to a different storage to free up space.

If the data that's taking up most of the space is related to application environments and packages such as conda, pip, or singularity, you can modify your configuration file to update the default directories for custom installs. You can find more information about the .condarc setup here: Conda

Performance¶

Q: Why is HiPerGator running so slow?

A: There are many reasons why users may experience unusually low performance while using HPG. First, users should ensure that performance issues are not originated from their Internet service provider, home network, or personal devices.

Once the possible causes above are discarded, users should report the issue as soon as possible via the RC Support Ticketing System. When reporting the issue, please include detailed information such as:

Time when the issue occurred
JobID
Nodes being used, i.e. username@hpg-node$. Note: Login nodes are not considered high performance nodes and intense jobs should not be executed from them.
Paths, file names, etc.
Operating system
Method for accessing HPG: Jupyterhub, Open OnDemand, or Terminal interface used.

Q: Are there profiling tools installed on HiPerGator that help identify performance bottlenecks?

A: The REMORA is the most generic profiling tool we have on the cluster. More specific tools depend on the application/stack or the language. E.g. cProfile for python code, Nsight Compute for CUDA apps, or VTune for C/C++ + MPI code.

Q: Why is my job still pending?

A: According to SLURM documentation, when a job cannot be started a reason is immediately found and recorded in the job's "reason" field in the squeue output and the scheduler moves on to the next job to consider.

Related article: Account and QOS limits under SLURM

Common reasons why jobs are pending¶

Priority

Resources being reserved for higher priority job. This is particularly common on Burst QOS jobs. Refer to the Choosing QOS for a Job page for details.

Resources

Required resources are in use

Dependency

Job dependencies not yet satisfied

Reservation

Waiting for advanced reservation

AssociationJobLimit

User or account job limit reached

AssociationResourceLimit

User or account resource limit reached

AssociationTimeLimit

User or account time limit reached

QOSJobLimit

Quality Of Service (QOS) job limit reached

QOSResourceLimit

Quality Of Service (QOS) resource limit reached

QOSTimeLimit

Quality Of Service (QOS) time limit reached

New User's Guide¶

For new users on HiPerGator, please read Getting Started to get yourself familiar with HiPerGator system and take New user training with step-by-step instructions on how to use HiPerGator.

AI Education and Training provides learning materials and training videos on various AI topics. JupyterHub and Jupyter Notebooks on HiPerGator are popular platforms for developing and running AI programs.

AI Software¶

A comprehensive software stack for AI research is available on HiPerGator for both CPU and GPU accelerated applications.

The NLP page has more information for software environment on HiPerGator for Natural Language Processing.

The Computer Vision page describes the software environment on HiPerGator for image and video processing.

AI Frameworks¶

AI frameworks provide building blocks for designing and training machine learning and deep learning models. The following AI frameworks are available on HiPerGator.

PyTorch¶

PyTorch is a deep learning framework developed by Facebook AI Research Lab and has interfaces for Python, Java, and C++, but is most commonly used with Python. It supports training on both GPU and CPUs, as well as distributed training and multi-GPU models. See our PyTorch quickstart page for help getting started using PyTorch on HiPerGator.

Tensorflow/Keras¶

TensorFlow is an open-source AI framework/platform developed by the Google Brain team. Keras is an open-source neural network library which runs on top of TensorFlow. With TensorFlow 2.0, the Keras API has been integrated in TensorFlow's core library and serves as a high-level Python interface for TensorFlow. TensorFlow supports both GPU and CPUs, as well as multi-GPU and distributed training. APIs are available for Python, Java, Go and C++. See our TensorFlow quickstart page for help getting started using TensorFlow on HiPerGator.

Tensorboard is a visualization tool for monitoring neural network training.

Sci-kit Learn¶

Sci-kit learn is a Python library for machine learning and statistical modeling. It is available in many of the Python modules on HiPerGator.

MATLAB¶

Matlab provides convenient toolboxes for machine learning, deep learning, computer vision and automatic driving, which are supported on both CPUs and GPUs.

Fastai¶

Fastai simplifies training fast and accurate neural nets using modern best practices. It can be used without any installation by using Google Colab.

NVIDIA AI Software¶

Nvidia provides comprehensive, GPU-accelerated software libraries, toolkits, frameworks and packages for big-data and AI applications. Many of the libraries are included in the CUDA installation on HiPerGator, such as cuDNN. The following domain specific CUDA enabled tools are available on HiPerGator:

Clara Parabricks (1) is a computation framework for genomics applications. It builds GPU accelerated libraries, pipelines, and reference AI workflows for genomics research. We have a license for this software through 2021.

MONAI (2) is the open-source foundation being created by Project MONAI. MONAI is a freely available, community-supported, PyTorch-based framework for deep learning in healthcare imaging. It provides domain-optimized foundational capabilities for developing healthcare imaging training workflows in a native PyTorch paradigm.

Megatron-LM can train several architectures of language models, including an GPT, T5, and an improved BERT. Megatron also recently added a transformer-based image classification architecture.

Modulus (3) is a neural network framework that blends the power of physics in the form of governing partial differential equations (PDEs) with data to build high-fidelity, parameterized surrogate models with near-real-time latency.

Nemo (4) is an open-source Python, GPU-accelerated toolkit for conversational AI, including speech recognition (ASR), natural language processing (NLP) and text to speech (TTS) applications. NeMo is available via a container in the apps folder.

RAPIDS (5) The RAPIDS suite of open source software libraries and APIs gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs.

Triton (6) Nvidia Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs. Triton supports an HTTP/REST and GRPC protocol that allows remote clients to request inferencing for any model being managed by the server.

AI Reference Datasets¶

A variety of reference machine learning and AI datasets are located in /data/ai/ref-data. Browse the catalog of all available AI reference datasets to learn more.

AI Examples¶

A suite of short examples are provided on using the software stack on HiPerGator to do different tasks in imaging and NLP, located in /data/ai/examples. Browse the catalog of all available AI examples to learn more.

AI Benchmarks and Models¶

Several commonly used benchmarks and models for NLP are provided on HiPerGator at /data/ai/benchmarks and /data/ai/models.