IL-SEOK HAN

Applied AI Scientist

Doodlin

3 2022 - Current

1. Research Process and AI Development Environment Setup

1) I am one of the early members of Doodlin. Since its inception, there has been no one dedicated to ML/AI research at the company, and the necessary environment and processes were not established. This lack of infrastructure not only hampers work efficiency but also makes it difficult to correct the research direction if it goes astray.

2) To address this issue, I have continuously conducted paper reviews with our CTO, who has relatively less knowledge in AI/ML technologies. Additionally, I have actively communicated with PO/PMs regarding the potential impacts, risks, feasible, and infeasible plans of AI/ML-based probabilistic models on our products. This collaboration has allowed me to gather diverse perspectives on research direction from various positions, which has been instrumental in refining our research focus.

3) Furthermore, I designed task metrics that are closely related to product performance, including both low-level metrics like F1 score and high-level metrics linked to actual user satisfaction. Although we have not yet achieved MLOps-level automation, establishing a process for testing new models is a significant milestone.

2. 2D Structured LLM Research

- Resume Key Information Extraction

- Sensitive Information Masking within Resumes

1) Resumes and similar documents pose challenges for LLMs (Large Language Models) to understand compared to plain text because the reading order and meaning can vary depending on the document’s layout. To address these challenges, I conducted research based on models like LayoutLM and BROS.

2) Additionally, resumes often contain a significant number of tokens, sometimes exceeding 4000-8000 in a single document. Traditional transformer-based models struggle with such documents due to computational and memory costs increasing quadratically with the number of tokens. To mitigate this, I researched models based on Longformer. I also re-trained the tokenizer to develop a vocabulary more efficient for resumes, reducing the required number of tokens.

3) Taking into account the specific characteristics of resumes and the previous research, I designed an original transformer model named "Wideformer," which was used to perform these tasks.

4) With these functionalities, HR managers can now upload resumes and register candidates without manually typing in the information. Moreover, the system allows for the extraction of structured data like work experience and education from unstructured resume data, enabling more effective candidate filtering and analysis.

3. Recommendation Algorithm

- Acceptance Rate Prediction

- Resume/Job Matching

1) I researched methods to learn correlations by utilizing embeddings generated by LLMs with Deep Metric Learning and Ranking models. Although still in the experimental phase, we are preparing to implement these features in our products.

4. Generative AI

- Job Posting and Sourcing Message Generation

1) I performed fine-tuning on the LLaMA 3 8B model using quantization and techniques like LoRA. This resulted in successfully generating natural job postings and sourcing messages.

2) However, what customers desire are job postings that attract more candidates and sourcing messages that receive better responses. Therefore, I am currently conducting research to fine-tune the model using an objective function aligned with these goals.

5. Technology Internalization

- Detection of Personal and Sensitive Information

I conducted research on detecting sensitive information within resumes using token-level classification.

- OCR

1) For resumes in image format, text is extracted using OCR. While we initially used ClovaOCR, we switched to using the Paddle OCR (an open-source library) for Text Detection, based on the EAST model, due to its already high accuracy. For Text Recognition, we trained a custom model using DtrOCR.

2) The model demonstrated accuracy levels similar to ClovaOCR at high resolutions but significantly improved performance in low-resolution scenarios, with accuracy increasing from 88.45% to 98.51%. For more complex texts, such as those with emails, accuracy increased from 23.08% to 97.64%.

Similar Profiles

Summary

Overview

Work History

Applied AI Scientist

Software Engineer

Education

Bachelor of Science - Physics, Computer Science Double Major

Skills

Certification

Awards

Timeline

Software Engineer

Bachelor of Science - Physics, Computer Science Double Major

Applied AI Scientist

Similar Profiles

LIRILKUMAR AMALLIRILKUMAR AMAL

OLUWASEYE DADAOLUWASEYE DADA

Meghashyam ChintaMeghashyam Chinta

Owen KirchenstienOwen Kirchenstien

Maissa BarkiaMaissa Barkia