Desay SV Automotive Singapore Pte. Ltd. is an innovative organization committed to exploring frontier technologies. While the company has a strong background in automotive electronics, this role is exclusively focused on advancing applications in large language models and on-device AI inference.
Duties/ Responsibilities
- On-Device Inference Engine Development. Design, develop, and optimize LLM inference engines for embedded, mobile, and edge devices — covering operator development, graph optimization, memory management, and multi-backend adaptation
- Model Compression & Lightweight Deployment. Research and apply quantization (INT4/INT8/FP16), pruning, distillation, and KV Cache compression techniques to achieve efficient inference on resource-constrained hardware
- Heterogeneous Hardware Optimization. Conduct operator-level performance tuning for ARM CPU, NPU, GPU, and DSP; use profiling tools to identify bottlenecks and continuously improve inference throughput and latency
- LLM Inference Acceleration. Participate in building LLM inference acceleration solutions — including speculative decoding, continuous batching, and KV Cache optimization — to improve model response efficiency on edge devices
- Cloud–Edge Collaboration. Collaborate on cloud AI Infra and on-device deployment pipelines: model export (ONNX/TorchScript), training–inference consistency validation, and joint cloud–edge inference architecture design
- Track Frontier LLM Developments. Stay current with cutting-edge LLM research; explore feasible paths for applying the latest model capabilities (e.g., reasoning models, multimodal) to real-world embedded product scenarios
Requirement
- Master's degree or above in Computer Vision, Machine Learning, Automation, or related field
- C++ Proficiency (Core Requirement). Expert-level C++ with deep understanding of memory models, concurrency, and low-level optimization. Proficient in Python for model conversion, evaluation scripts, and training toolin
- Cloud AI Infra or Embedded Inference Framework Experience. Hands-on experience with either: (a) large-scale GPU training cluster operations and optimization, or (b) core module development in on-device inference frameworks such as MNN, TNN, NCNN, or ExecuTorc
- Large Model Algorithm Fundamentals. Solid understanding of Transformer attention mechanisms, KV Cache, continuous batching, and speculative decoding. Familiar with mainstream open-source model architectures including LLaMA, Qwen, Gemma, and Mistra
- Embedded Systems & Heterogeneous Hardware. Understanding of embedded system principles and heterogeneous hardware architectures (ARM, Snapdragon, MTK, Apple Silicon). Experience with driver adaptation or BSP is a plu
- Engineering Discipline. Proficient in Linux development environments; experienced with performance profiling (perf, Instruments, Snapdragon Profiler), unit testing, and CI/CD workflow