We are building an intelligent AI platform designed to operate autonomously at scale — enabling enterprise systems to learn, adapt, and execute complex tasks efficiently.
As a Senior ML Infrastructure Engineer, you will design and build the pipelines, automation, and tools that power the entire ML lifecycle — from data ingestion to model deployment. Your work will directly enable ML Engineers and Data Scientists to move from experiment to production with speed, confidence, and reliability.
You’ll collaborate with experienced platform engineers working on Terraform, container orchestration, observability, and CI/CD to ensure seamless integration between process-level automation and system infrastructure.
Design and implement reproducible ML pipelines for data ingestion, feature engineering, model training, evaluation, and deployment.
Develop internal tooling and workflows for experiment tracking, model registry, and automated evaluation.
Automate model validation pipelines, integrating statistical tests, fairness checks, and performance regressions.
Optimize training workflows for distributed GPUs and efficient cloud resource usage.
Collaborate with ML Engineers to deploy and monitor models in production, applying modern CI/CD and infrastructure-as-code practices.
Abstract complex infrastructure into reusable templates, SDKs, or tools that improve team productivity.
Work with Data Scientists to standardize experiment setup, metadata, and artifact management.
Establish observability standards for ML systems (logging, metrics, tracing, alerting).
Partner with global platform and data engineering teams to align pipeline requirements with platform capabilities (storage, compute, secrets management, orchestration).
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
3+ years of experience in software engineering or ML infrastructure roles.
Proficiency in Python, SQL, and shell scripting.
Hands-on experience with cloud services (AWS, GCP, or Azure) and container orchestration (Docker, Kubernetes).
Experience building data or model pipelines using tools like Airflow, Kubeflow, or MLFlow.
Familiarity with model deployment frameworks such as TensorFlow Serving, Triton, VLLM, or FastAPI.
Experience implementing CI/CD for ML systems and managing infrastructure as code (Terraform or similar).
Strong English communication skills to collaborate with global (US-based) teams.
Preferred Qualifications
Experience optimizing GPU workloads and distributed training systems.
Knowledge of feature stores, vector databases, or retrieval infrastructure for large language models.
Familiarity with model monitoring and evaluation in production environments.
Strong understanding of security and compliance best practices in ML infrastructure.
Apply via email: send your CV to contact@kwise.io
