Computer Vision Engineer (Multimodal with Large Language Models)

San Francisco, California

  Machine Learning


A Computer Vision / AI Vision Platform company is seeking a skilled and enthusiastic Computer Vision Engineer to join their team. In this role, your primary focus will be on developing innovative solutions that combine computer vision techniques with large language models to enable multimodal analysis and understanding of visual and textual data.

  • Multimodal System Design:
    • Collaborate with cross-functional teams, including data scientists, machine learning engineers, and software developers, to design and architect multimodal systems that integrate computer vision algorithms and large language models. These systems will process and interpret multimodal data to extract meaningful insights.
  • Computer Vision Model Development
    • Develop and optimize computer vision models for tasks such as object detection, image segmentation, image captioning, and visual question answering. Adapt and fine-tune existing models to work seamlessly with large language models for multimodal analysis.
  • Large Language Model Integration
    • Integrate large language models into the multimodal system. Fine-tune and optimize the language models for tasks such as text generation, sentiment analysis, and language translation to facilitate seamless interaction between visual and textual data.
  • Training and Evaluation:
    • Train, validate, and optimize computer vision and large language models using state-of-the-art frameworks and tools. Conduct rigorous testing and evaluation to measure the performance, robustness, and generalization capabilities of the developed multimodal system.
  • Research and Development
    • Stay up-to-date with the latest advancements in computer vision, natural language processing, and large language models. Conduct research to explore and identify novel approaches that leverage multimodal techniques for enhanced analysis and understanding of visual and textual data.
  • Data Processing and Annotation:
    • Work closely with the data team to collect and preprocess multimodal datasets. Develop annotation pipelines and guidelines to ensure accurate labeling and annotation of visual and textual data for training and evaluation purposes.


  • Education
    • Bachelor's, Master's, or Ph.D. degree in Computer Science, Electrical Engineering, or a related field. Strong academic background with coursework or research experience in computer vision, natural language processing, machine learning, or deep learning.
  • Technical Skills
    • Proficiency in programming languages such as Python, C++, or OpenCV. Experience with deep learning frameworks like TensorFlow, PyTorch, or Keras. Very Strong understanding of computer vision techniques and large language models
  • Experience
    • Demonstrated experience in developing computer vision solutions, with a focus on multimodal analysis and integration with large language models. Familiarity with dataset handling, preprocessing, and model evaluation. Experience in deploying models in real-world applications is a plus.
  • Collaboration and Communication
    • Excellent teamwork and communication skills, with the ability to collaborate effectively with cross-functional teams. Proficient in presenting complex technical concepts to both technical and non-technical stakeholders.
  • Research Mindset
    • Enthusiasm for staying up-to-date with the latest research advancements in computer vision, natural language processing, and multimodal analysis. Strong desire to explore and apply cutting-edge techniques to solve challenging problems.