← Back to all jobs

Senior Research Engineer Multimodal Video Foundation Model

Tether logo

Tether

📍 Zürich, CH💰 $90k - $150k🕐 Posted 2 months ago
Data ScientistRemoteMachine LearningPythonDeep LearningComputer VisionAI Research
Apply

Job Description

About Tether

At Tether, we're not just building products, we're pioneering a global financial revolution. Our cutting-edge solutions empower businesses—from exchanges and wallets to payment processors and ATMs—to seamlessly integrate reserve-backed tokens across blockchains. By harnessing the power of blockchain technology, Tether enables you to store, send, and receive digital tokens instantly, securely, and globally, all at a fraction of the cost. Transparency is the bedrock of everything we do, ensuring trust in every transaction.

About the Role

As a member of the AI model team, you will drive innovation in architecture development for cutting-edge models of various scales, including small, large, and multi-modal systems. Your work will enhance intelligence, improve efficiency, and introduce new capabilities to advance the field.

You will have a deep expertise in video generation model architectures with a hands-on, research-driven approach. Your mission is to explore and implement novel techniques and algorithms that lead to groundbreaking advancements: data curation, strengthening baselines, identifying and resolving existing pre-training bottlenecks to push the limits of model performance.

Responsibilities

  • Pioneer multimodal and video-centric research that moves fast and breaks ground, contributing directly to usable prototypes and scalable systems.
  • Design and implement novel AI architectures for multimodal language models, integrating text, visual, and audio modalities.
  • Engineer scalable training and inference pipelines optimized for large-scale multimodal datasets and distributed GPU systems across thousands of GPUs.
  • Optimize systems and algorithms for efficient data processing, model execution, and pipeline throughput.
  • Build modular tools for preprocessing, analyzing, and managing multimodal data assets (e.g., images, video, text).
  • Collaborate cross-functionally with research and engineering teams to translate cutting-edge model innovations into production-grade solutions.
  • Prototype generative AI applications showcasing new capabilities of multimodal foundation models in real-world products.
  • Develop benchmarking tools to rigorously evaluate model performance across diverse multimodal tasks.

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, or a related technical field, or equivalent practical experience
  • Expertise in Python & Pytorch, including practical experience working with the full development pipeline from data processing & data loading to training, inference, and optimization.
  • Experience working with large-scale text data, or (bonus) interleaved data spanning audio, video, image, and/or text.
  • Direct hands-on experience in developing or benchmarking at least one of the following topics: LLMs, Vision Language Models, Audio Language Models, generative video models.

Nice to Have

  • PhD in Computer Vision, Machine Learning, NLP, Computer Science, Applied Statistics, or a closely related field
  • Demonstrated expertise in computer vision, video generation foundation model and/or multimodal research.
  • First-author publications at leading AI conferences such as CVPR, ICCV, ECCV, ICML, ICLR, NeurIPS etc.

Compensation

Estimated: $90k - $150k

Location

ZH Zürich CH