Inference Reference
GenAI Studio supports the following model and hardware pair configurations.
Mistral-7b | Configuration |
---|---|
T4 | - slots_per_trial: 4 - max_new_tokens: 4000 - batch_size: 10 - swap_space: 8 - torch_dtype: “float16” |
- slots_per_trial: 2 - max_new_tokens: 4000 - batch_size: 10 - swap_space: 8 - torch_dtype: “float16” |
|
V100 | - slots_per_trial: 2 - max_new_tokens: 4000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16” |
A100 | - slots_per_trial: 1 - max_new_tokens: 4000 - batch_size: 10 - swap_space: 8 - torch_dtype: “float16” |
Llama-2-7b | Configuration | Llama-2-13b | Configuration |
---|---|---|---|
T4 | - slots_per_trial: 4 - max_new_tokens: 4000 - batch_size: 10 - swap_space: 8 - torch_dtype: “float16” |
T4 | - slots_per_trial: 4 - max_new_tokens: 4000 - batch_size: 10 - swap_space: 8 - torch_dtype: “float16” |
- slots_per_trial: 2 - max_new_tokens: 4000 - batch_size: 10 - swap_space: 8 - torch_dtype: “float16” |
|||
V100 | - slots_per_trial: 2 - max_new_tokens: 4000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16” |
V100 | - slots_per_trial: 2 - max_new_tokens: 1500 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16” |
- slots_per_trial: 4 - max_new_tokens: 4000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16” |
|||
A100 | - slots_per_trial: 1 - max_new_tokens: 4000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16” |
A100 | - slots_per_trial: 1 - max_new_tokens: 4000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16” |
- slots_per_trial: 2 - max_new_tokens: 4000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16” |
Llama-2-70b | Configuration | ||
---|---|---|---|
A100 | - slots_per_trial: 4 - max_new_tokens: 4000 - batch_size: 100 - swap_space: 16 - torch_dtype: “float16” |
falcon-7b | Configuration | falcon-40b | Configuration |
---|---|---|---|
A100 | - slots_per_trial: 1 - max_new_tokens: 2000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16” |
A100 | - slots_per_trial: 4 - max_new_tokens: 2000 - batch_size: 100 - swap_space: 16 - torch_dtype: “float16” |
V100 | - slots_per_trial: 8 - max_new_tokens: 2000 - batch_size: 100 - swap_space: 16 - torch_dtype: “float16” |
mpt-7b | Configuration | mpt-30b | Configuration |
---|---|---|---|
A100 | - slots_per_trial: 1 - max_new_tokens: 2000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16” |
A100 | - slots_per_trial: 2 - max_new_tokens: 4000 - batch_size: 100 - swap_space: 16 - torch_dtype: “float16” |
- slots_per_trial: 4 - max_new_tokens: 4000 - batch_size: 100 - swap_space: 16 - torch_dtype: “float16” |
|||
V100 | - slots_per_trial: 2 - max_new_tokens: 2000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16” |
V100 | - slots_per_trial: 8 - max_new_tokens: 4000 - batch_size: 1 - swap_space: 16 - torch_dtype: “float16” |
T4 | - slots_per_trial: 2 - max_new_tokens: 2000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16” |
||
- slots_per_trial: 4 - max_new_tokens: 2000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16” |
Configuration Description #
-
slots_per_trial: The number of slots (GPUs) each trial (such as a run of inference) will use. For example, if
slots_per_trial
is set to 8 and the hardware type isV100
, then one inference task will need 8 V100 GPUs. -
max_new_tokens: The maximum number of new tokens that can be processed or generated during a training or inference task. Tokens are units of text, such as words or subwords, that are used in natural language processing tasks. For example, if a model has a
max_new_tokens
value of4000
the model can generate or process up to 4000 new tokens. -
batch_size: The number of batches the model will process at a time.
-
swap_space: The region of the computer’s hard drive that is used as virtual memory when the physical RAM is fully utilized. Swap space is an extension of RAM that allows the system to store data temporarily when the physical memory is insufficient. A
swap_space
of16
indicates 16 GB of space on the hard drive is allocated to be used as virtual memory. -
torch_dtype: Specifies the data type used during training. For example,
float16
reduces memory usage and can help with faster computation.