MISSION
As the founding MLOps engineer, design and build Shizuku’s ML infrastructure from the ground up. Establish the complete pipeline — from data ingestion through training environments to model serving — creating an internal platform that empowers ML engineers to iterate on models at maximum velocity.
Replace individual, siloed development environments with a unified team-scale ML development platform, maximizing the speed of Shizuku’s evolution.
ABOUT SHIZUKU
Shizuku is a Japan-born AI companion actively engaging audiences on YouTube and X (formerly Twitter). Already running live streams and cultivating a growing community, Shizuku is now entering its next phase of rapid scale.
As the first Japanese startup to receive investment from a16z, we closed our seed round and are on a mission to bring Japanese entertainment × AI to the global stage.
TEAM STRUCTURE
You will work closely with founder Aki (ML engineer and researcher, ex-Meta, ex-Luma AI) and Engineering Director Ohno to drive the design and construction of our ML infrastructure. As the first MLOps engineer, you’ll have significant autonomy — from technology selection to operational design.
Post-foundation, career paths include both a management track leading a growing team and an IC track deepening technical expertise, tailored to your aspirations.
CURRENT STATE & WHAT YOU’LL BUILD
- Infrastructure Status: Modern application infrastructure is in place, but ML training and MLOps tooling are not yet established. AWS adoption is planned
- What You’ll Build: An internal platform for ML engineers developing Shizuku’s AI models. The goal: eliminate siloed, ad-hoc local workflows and code ownership by individuals, replacing them with a team-oriented ML development foundation
KEY RESPONSIBILITIES
- Design, build, and operate the end-to-end ML training pipeline: data collection/preprocessing → training → evaluation → deployment
- Design and build GPU training infrastructure on AWS (A100, L4, etc.) with cost optimization
- Build an internal ML platform for engineers: experiment tracking, model versioning, and reproducibility guarantees
- Design and build model serving infrastructure: inference APIs, auto-scaling, and latency management
- Establish training data management and quality assurance pipelines
- Design and implement CI/CD for ML: automated training, model testing/evaluation, and staged rollouts
- Drive production integration of models in collaboration with ML Engineer and SWE teams
- Build monitoring and visibility infrastructure for long-term compute cost and GPU utilization tracking
REQUIREMENTS
- 3+ years of experience designing, building, and operating cloud infrastructure on AWS, GCP, or equivalent platforms
- Experience building ML/DL pipelines and infrastructure
- Hands-on experience designing and operating production environments using container technologies (Docker/Kubernetes)
- Experience managing infrastructure as code (Terraform, Pulumi, etc.)
- Strong Python skills for building tools and pipelines
- Ability to work on-site at our Tokyo office (primarily in-office with flexible remote arrangements)
NICE TO HAVE
- Experience building, operating, and cost-optimizing GPU clusters (A100, H100, L4, etc.)
- Experience with ML platforms: SageMaker, Vertex AI, Ray, Kubeflow, etc.
- Experience deploying and operating experiment tracking infrastructure: MLflow, Weights & Biases, DVC, etc.
- Experience building model serving infrastructure: Triton Inference Server, TorchServe, vLLM, SGLANG, etc.
- Experience designing and building internal ML development platforms
- Domain-specific knowledge of ML workloads in speech, NLP, or vision
- Experience as a founding infrastructure/MLOps engineer at a startup
- Technical communication skills in English (currently Japanese-first internally; transitioning to a global environment in the mid-term)
WHO YOU ARE
- Founding Engineer Mentality — You don’t wait for established systems to improve — you define the design philosophy and build the foundation from zero. You’re energized by creating the system itself, not just refining one
- ML-Literate Infrastructure Engineer — You understand the unique characteristics of ML training and inference workloads, and you translate that understanding into optimally designed infrastructure
- Purpose-Driven Ownership — You reverse-engineer from “maximizing ML team velocity,” set your own priorities, and drive execution autonomously
- Comfort with Ambiguity — You design for a world where model count, training frequency, and data volume are still being defined — starting small and scaling architecturally as the picture clarifies
- Resilience & Respect — You engage as an equal partner with ML Engineers and SWEs, elevating the entire team’s productivity through collaboration