After you log in to iQ Studio for the first time, deploy your model. Model deployment involves verifying access to Model Manager, estimating the resources the model needs, deploying the model, and monitoring the deployment status.
- Log in to iQ Studio as an administrator, and then select Open Model Manager.
- From the Model Operations page, select Resource Recommendation.
- Enter the model details.
- Model name: Enter the model name. Use letters, numbers, hyphens, underscores, and slash. Maximum length: 100 characters.
- Model version/revision: Auto‑filled based on the selected model.
- Data type: Select the inference precision: Float32, Float16, or BFloat16.
- Tokens: Enter the number of tokens for the model. Default value is 3200.
- Click Apply.
- Review and note the recommended values for CPU, RAM, and GPU. You will need these values in Step 7.
- GPU Memory
- CPU Cores
- RAM Range
- Model Size
- From the Model Operations page, select Deploy model.
- Complete the deployment form using the recommended resource values you recorded earlier.
Important: For all model deployments, trust-remote-code field must be left blank. The displayed text Optional is a placeholder and is not a valid value.Figure. Embedding model configuration example
Figure. Reranker model configuration example
Figure. Inferencing model configuration example
The following table explains each field in the deployment form.
Field Description Model Details Model Name Deployed model endpoint. Model Revision Specifies the exact version of the model source. Model Version Identifies the deployment instance generated by the platform. Model Format Defines the inference runtime and weight format. Resource Allocation CPU Limits Sets the maximum number of CPU cores the model workload can use. Exceeding this value throttles execution. CPU Requests Reserves the specified number of CPU cores for scheduling the workload on a node. Memory Limits Sets the maximum amount of RAM available to the model container. Memory Requests Reserves the specified amount of memory to make sure successful scheduling. CPU or MIG Type Specifies the GPU device or NVIDIA MIG partition used for inference execution. CPU or MIG Requests Defines the number of GPUs or MIG instances allocated to the model. Used to control parallelism and scaling boundaries. Advanced Parameters – Model Arguments model Specifies the filesystem path that contains the model weights and configuration files. trust-remote-code Controls whether the system allows execution of custom code from the model repository during model loading. Use true for models that require custom implementations; otherwise, leave it empty or set it to false. Required only for models that include custom code. served-model-name Defines the logical model name exposed by the inference server for client requests. max-model-len Sets the maximum number of tokens allowed in a single request context. max-num-seqs Specifies the maximum number of concurrent input sequences. gpu-memory-utilization Limits the fraction of GPU memory that the inference engine can allocate. enable-auto-tool-choice Enables automatic selection of tools during inference when the model supports tool calling. tool-call-parser Specifies the parser used to extract tool call instructions from model output. tensor-parallel-size Sets the number of GPUs used to shard model tensors for parallel execution. Advanced Parameters – Environment Variables Environment Variables Defines runtime key–value pairs injected into the model container environment. HV_IQSTUDIO_MODEL_TYPE Specifies the execution mode of the model workload. is_default Specifies whether the deployed model is used as the default model in the serving environment. When set to true, the system routes requests that do not explicitly specify a model to this deployment. - Review the configuration.
- Click Submit.
- Monitor the deployment status.
When the model status changes to Ready, the model is deployed and available for use.