Back to Jobs

Binance

We're hiring

Fine Tuning/Post Training Data Scientist - RL (GRPO, PPO, RLHF)

Asia
Full-time

About this role

About the Role

You will develop and optimize Reinforcement Learning (RL) models for enterprise-scale applications such as customer service, token reporting, compliance, and Web3 domain reasoning.

You will explore and evaluate advanced Algorithms including PPO, GRPO, DPO, RLHF, RLAIF, and Agentic RL to enhance the capabilities of LLMs, VLMs, and Agentic AI at Binance. The role requires a strong theoretical foundation in RL—covering policy optimization, reward modeling, and planning—paired with the Engineering skills to build scalable production systems.

You will take full ownership from research through deployment, driving experimentation with systematic evaluation and benchmarking. Collaboration across research, infrastructure, and application teams will be key to delivering impactful AI solutions.

Responsibilities:

  • Research and develop state-of-the-art RL algorithms, focusing on large model optimization and alignment techniques.
  • Design and implement RL training pipelines, including environment simulation, data generation, and reward function design.
  • Apply Reinforcement Learning methods to enhance LLM/VLM/Agentic AI capabilities in reasoning, planning, and autonomous decision-making.
  • Collaborate with Engineers and researchers to integrate RL solutions into enterprise AI platforms.
  • Monitor model performance in production and continuously improve through Iterative training and Fine-tuning.

Requirements:

  • Master’s Degree in Computer Science, Applied Mathematics, Machine Learning, or related fields.
  • 5+ years of hands-on experience in RL and [either 1: LLM/VLM/Agentic AI] optimization.
  • Strong coding skills in Python, with experience in ML frameworks and RL libraries.
  • Experience with large-scale distributed training and optimization.
  • Self-driven, ownership mindset, and strong problem-solving skills. Excellent communication skills for cross-functional collaboration.

Check Your ATS Score

See how well your resume matches this Fine Tuning/Post Training Data Scientist - RL (GRPO, PPO, RLHF) position and get instant optimization tips.

Check ATS Score →