Dynamic and innovative Applied AI Scientist with over 3 years of experience in designing and implementing cutting-edge AI/ML solutions within the B2B SaaS domain. Adept at developing advanced machine learning models, including transformers and generative AI, with a deep understanding of NLP, data processing pipelines, and autonomous driving algorithms. Proven ability to lead research projects, optimize AI environments, and collaborate effectively across multidisciplinary teams to refine product strategies and improve operational efficiency. Strong foundation in both theoretical and applied aspects of AI, with a passion for integrating emerging technologies into practical business solutions. Recognized for significant contributions to product development, anomaly detection systems, and multi-map path-finding algorithms, backed by multiple awards and recognitions in AI and autonomous driving challenges.
1. Research Process and AI Development Environment Setup
1) I am one of the early members of Doodlin. Since its inception, there has been no one dedicated to ML/AI research at the company, and the necessary environment and processes were not established. This lack of infrastructure not only hampers work efficiency but also makes it difficult to correct the research direction if it goes astray.
2) To address this issue, I have continuously conducted paper reviews with our CTO, who has relatively less knowledge in AI/ML technologies. Additionally, I have actively communicated with PO/PMs regarding the potential impacts, risks, feasible, and infeasible plans of AI/ML-based probabilistic models on our products. This collaboration has allowed me to gather diverse perspectives on research direction from various positions, which has been instrumental in refining our research focus.
3) Furthermore, I designed task metrics that are closely related to product performance, including both low-level metrics like F1 score and high-level metrics linked to actual user satisfaction. Although we have not yet achieved MLOps-level automation, establishing a process for testing new models is a significant milestone.
2. 2D Structured LLM Research
- Resume Key Information Extraction
- Sensitive Information Masking within Resumes
1) Resumes and similar documents pose challenges for LLMs (Large Language Models) to understand compared to plain text because the reading order and meaning can vary depending on the document’s layout. To address these challenges, I conducted research based on models like LayoutLM and BROS.
2) Additionally, resumes often contain a significant number of tokens, sometimes exceeding 4000-8000 in a single document. Traditional transformer-based models struggle with such documents due to computational and memory costs increasing quadratically with the number of tokens. To mitigate this, I researched models based on Longformer. I also re-trained the tokenizer to develop a vocabulary more efficient for resumes, reducing the required number of tokens.
3) Taking into account the specific characteristics of resumes and the previous research, I designed an original transformer model named "Wideformer," which was used to perform these tasks.
4) With these functionalities, HR managers can now upload resumes and register candidates without manually typing in the information. Moreover, the system allows for the extraction of structured data like work experience and education from unstructured resume data, enabling more effective candidate filtering and analysis.
3. Recommendation Algorithm
- Acceptance Rate Prediction
- Resume/Job Matching
1) I researched methods to learn correlations by utilizing embeddings generated by LLMs with Deep Metric Learning and Ranking models. Although still in the experimental phase, we are preparing to implement these features in our products.
4. Generative AI
- Job Posting and Sourcing Message Generation
1) I performed fine-tuning on the LLaMA 3 8B model using quantization and techniques like LoRA. This resulted in successfully generating natural job postings and sourcing messages.
2) However, what customers desire are job postings that attract more candidates and sourcing messages that receive better responses. Therefore, I am currently conducting research to fine-tune the model using an objective function aligned with these goals.
5. Technology Internalization
- Detection of Personal and Sensitive Information
I conducted research on detecting sensitive information within resumes using token-level classification.
- OCR
1) For resumes in image format, text is extracted using OCR. While we initially used ClovaOCR, we switched to using the Paddle OCR (an open-source library) for Text Detection, based on the EAST model, due to its already high accuracy. For Text Recognition, we trained a custom model using DtrOCR.
2) The model demonstrated accuracy levels similar to ClovaOCR at high resolutions but significantly improved performance in low-resolution scenarios, with accuracy increasing from 88.45% to 98.51%. For more complex texts, such as those with emails, accuracy increased from 23.08% to 97.64%.
1. Autonomous Driving Algorithm Development
- Multi Map Pathfinding Algorithm Development
1) In a situation where a low-level pathfinding algorithm was already implemented, I developed the Multi Map functionality by merging multiple maps according to their orientations and then finding the route. Although the basic idea was provided by a senior developer, I encountered and resolved issues related to misaligned coordinates during the merging process.
2. Data Loading and Analysis
- Driving Data Loading Pipeline Setup
1) To ensure the driving data logs were preserved in cases where the robot went offline or the power was turned off during operation, I wrote code that logged the driving data locally on the robot using system calls. This code also ensured that the logs were not duplicated or lost when the robot was rebooted.
2) I set up a pipeline to aggregate this data into a driving data table and designed the schema for this table.
- Anomaly Detection Algorithm Based on Driving Data
1) I applied algorithms like IQR (Interquartile Range) and LOF (Local Outlier Factor) to the driving data table mentioned above. I conducted research and comparisons of applicable algorithms and implemented the most suitable ones. This enabled Robot Operators to detect robots that were suspected of having issues.
PyTorch
Machine Learning
NLP
NumPy
Pandas
Python
Git
Elasticsearch
AWS
Software Maestro 12th Cohort, Ministry of Science and ICT
Method for Generating Verifiable Random Numbers, Patent Application
Software Maestro 12th Cohort, Ministry of Science and ICT