






As an Applied Research intern at Labelbox, you will design, build, and productionize evaluation and post‑training systems for frontier LLMs and multimodal models. You’ll own continuous, high-quality evals and benchmarks (reasoning, code, agent/tool‑use, long‑context, vision‑language, et al.), create and curate post‑training datasets (human + synthetic), and prototype RLHF/RLAIF/RLVR/RM/DPO‑style training loops to measure and improve real‑world task and agent performance. Build and own evaluation and benchmark suites for reasoning, code, agents, long‑context, and V/LLMs. Create post‑training datasets at scale: design preference/critique pipelines (human + synthetic), and target hard failures surfaced by evals. Experiment and prototype RLHF/RLAIF/RLVR/RM/DPO‑style training loops to improve real‑world task and agent performance. Land research in product: ship improvements into Labelbox workflows, services, and customer‑facing evaluation/quality features; quantify impact with customer and internal metrics. Engage with customer research teams: run pilots, co‑design benchmarks, and share practical findings through internal research reports, blog posts, talks, and published papers.