Introduction

As someone currently working in a “Prompt Engineering” role, I’ve been thinking a lot about how this title basically doesn’t exist outside of a handful of companies, how the title communicates a narrow range of skills and responsibilities,and how the work that I do day-to-day is much larger in scope than just writing prompts. I identify more as an AI Engineer or AI Research Scientist, and so I was interested to see what I could learn about other similar roles that work with GenAI and LLMs.

So with that motivation, I set about to collect some data and look at the job market for these sorts of roles. What responsibilities and skills are being advertised most often, what kinds of titles are being used for these roles, and what kind of compensation is being offered at different levels of seniority?

To accomplish this, I built a custom web scraper to collect ~1000 job postings from ai-jobs.net. My code gathers job details like the title, company, location, salary, and posting date – from a range of U.S. and Canadian cities, covering entry-level, mid-level, and senior positions. I then use a Large Language Model (LLM) to extract each job’s key responsibilities, required skills, and qualifications. Afterwards, I classify each position to see whether it involves working with Generative AI or large language models, and if so, categorize it further into four major AI roles: (1) AI Research Scientist, (2) AI/ML Engineer, (3) MLOps/AI Infrastructure Engineer, and (4) AI Solution Architect. I believe this set of roles is a good representation of different areas of focus.

Finally, I integrate the various metadata and classifications into a comprehensive dataset. I observe that GenAI/LLM positions command consistently high salary ranges across the four different roles, particularly at more senior levels. Senior-level roles tended to offer median salaries in the $195K–$210K range, while mid-level roles generally clustered around $165K–$180K. Entry-level salaries showed greater variation (likely due to the small sample size) but still landed in competitive ranges of roughly $155K–$205K in many postings. These roles often share common technical demands—like proficiency with large-scale model training, distributed computing, and LLM-specific knowledge—though each role emphasizes distinct priorities (research vs. production, for example).

Of course, this analysis is not without limitations. For example, I am relying on a single job board, and I scraped jobs during a limited window of time. I also have not rigorously validated the LLM classifications – although I have implemented many prompt engineering best practices, and have used some of the more powerful LLMs (4o and o1). To some extent, the responsibilities and skills that were recovered from the original job postings classified into the four pre-defined roles do speak to the relative accuracy of the role classifications. The salary analysis also does not distinguish between base salary and total compensation, remote vs. in-person opportunities, large vs. small companies, and so on. But overall, I think this gives some sense of the job market for these roles.

Scraping job postings

I’ll scrape job postings from ai-jobs.net. This is an aggregator that specializes in AI jobs of all kinds, sourcing jobs from over 60 countries, and in my experience, it does a pretty good job.

The code below implements a web scraper for job postings from ai-jobs.net. It collects job postings from major cities in the US and Canada, searching across entry-level, mid-level, and senior positions. For each job, it gets the title, company, location, salary, description, and posting date. The scraper saves each job as a JSON file and keeps track of what it has already scraped to avoid duplicates. It includes error handling and logging to track any problems that occur during scraping.

Click to view the Job Posting Scraper code

from dataclasses import dataclass
from typing import List, Dict, Optional, Set
from datetime import datetime, timedelta
import logging
import json
import requests
from bs4 import BeautifulSoup
from pathlib import Path
import re
import time

@dataclass
class JobData:
    """Structured container for job posting data"""
    title: str
    company: str
    location: str
    level: Optional[str]
    salary: Optional[str]
    url: str
    description: str
    scraped_date: str
    posted_date: Optional[str]
    raw_data: str

class ScrapeConfig:
    """Configuration settings for the scraper"""
    EXPERIENCE_LEVELS = ['EN', 'MI', 'SE']  # Entry, Mid, Senior
    CITIES = {
        '5391959': 'San Francisco',
        '5128581': 'New York City',
        '6167865': 'Toronto',
        '6173331': 'Vancouver',
        '5809844': 'Seattle',
        '4671654': 'Austin',
        '4930956': 'Boston',
        '5': 'Region'
    }
    CATEGORIES = {
        '1': 'Research',
        '2': 'Engineering',
        '18': 'GenerativeAI'
    }
    BASE_URL = "https://ai-jobs.net"

class SalaryExtractor:
    """Handles salary extraction logic"""
    @staticmethod
    def extract_from_schema(schema_data: Dict) -> Optional[str]:
        """Extract salary from schema.org data"""
        try:
            if schema_data and 'baseSalary' in schema_data:
                base_salary = schema_data.get('baseSalary', {}).get('value', {})
                if base_salary:
                    min_value = base_salary.get('minValue')
                    max_value = base_salary.get('maxValue')
                    currency = base_salary.get('currency', 'USD')  # Default to USD if not specified
                    if min_value and max_value:
                        return f"{currency} {float(min_value)/1000:.0f}K - {float(max_value)/1000:.0f}K"
                    elif min_value:
                        return f"{currency} {float(min_value)/1000:.0f}K+"
        except Exception as e:
            logging.debug(f"Error extracting schema salary: {e}")
        return None

    @staticmethod
    def extract_from_text(description: str) -> Optional[str]:
        """Extract salary from description text"""
        patterns = [
            r'(?:salary range.*?)(?:CAD|\$)?([\d,]+)\s*-\s*(?:CAD|\$)?([\d,]+)',
            r'(?:salary.*?)(?:CAD|\$)?([\d,]+)(?:\s*-\s*(?:CAD|\$)?([\d,]+))?'
        ]
        
        for pattern in patterns:
            if match := re.search(pattern, description, re.IGNORECASE):
                try:
                    groups = match.groups()
                    # Check if salary is in CAD
                    currency = 'CAD' if 'CAD' in description.upper() else 'USD'
                    if len(groups) == 2 and groups[1]:  # Range format
                        min_sal = float(groups[0].replace(',', ''))
                        max_sal = float(groups[1].replace(',', ''))
                        return f"{currency} {min_sal/1000:.0f}K - {max_sal/1000:.0f}K"
                    elif groups[0]:  # Single value format
                        base_sal = float(groups[0].replace(',', ''))
                        return f"{currency} {base_sal/1000:.0f}K+"
                except ValueError:
                    continue
        return None

    @classmethod
    def extract(cls, soup: BeautifulSoup, description: str, schema_data: Dict) -> Optional[str]:
        """Main salary extraction method"""
        # Try schema data first
        if salary := cls.extract_from_schema(schema_data):
            return salary
            
        # Try description text
        if salary := cls.extract_from_text(description):
            return salary
            
        # Try salary badge
        if salary_badge := soup.find('span', class_='badge rounded-pill text-bg-success'):
            salary_text = salary_badge.text.strip()
            if re.search(r'(USD|\$|\d)', salary_text):
                return salary_text
                
        return None

class JobPageParser:
    """Handles parsing of individual job pages"""
    def __init__(self, html: str):
        self.soup = BeautifulSoup(html, 'html.parser')
        self.raw_html = html
        
    def parse_schema_data(self) -> Dict:
        """Parse schema.org JSON-LD data"""
        if script := self.soup.find('script', type='application/ld+json'):
            try:
                cleaned_script = re.sub(r'[\x00-\x1F\x7F-\x9F]', '', script.string)
                return json.loads(cleaned_script)
            except Exception as e:
                logging.warning(f"Could not parse JSON-LD data: {e}")
        return {}
        
    def parse_job_data(self, url: str) -> JobData:
        """Extract all job data from the page"""
        schema_data = self.parse_schema_data()
        scraped_date = datetime.now().isoformat()
        
        # Get company name from schema or fallback to page element
        company = schema_data.get('hiringOrganization', {}).get('name')
        if not company:
            if company_elem := self.soup.find('a', class_='company-name'):
                company = company_elem.text.strip()
                
        # Get job level from meta description
        level = None
        if meta_desc := self.soup.find('meta', {'name': 'description'}):
            if level_match := re.search(r'a ([\w-]+level)', meta_desc['content']):
                level = level_match.group(1).replace('-', '-level / ').title()
        
        return JobData(
            title=self.soup.find('h1', class_='display-5').text.strip(),
            company=company,
            location=self.soup.find('h3', class_='lead').text.strip(),
            level=level,
            salary=SalaryExtractor.extract(
                self.soup,
                self.soup.find('div', class_='job-description-text').text.strip(),
                schema_data
            ),
            url=url,
            description=self.soup.find('div', class_='job-description-text').text.strip(),
            scraped_date=scraped_date,
            posted_date=self.calculate_posted_date(scraped_date),
            raw_data=self.raw_html
        )
        
    def calculate_posted_date(self, scraped_date_str: str) -> Optional[str]:
        """Calculate posting date from 'posted X time ago' text"""
        if match := re.search(r'Posted (\d+) (hours?|days?|weeks?|months?) ago', self.raw_html):
            number = int(match.group(1))
            unit = match.group(2)
            
            scraped_date = datetime.fromisoformat(scraped_date_str)
            delta = {
                'hour': timedelta(hours=number),
                'day': timedelta(days=number),
                'week': timedelta(weeks=number),
                'month': timedelta(days=number * 30)  # Approximate
            }.get(unit.rstrip('s'))
            
            if delta:
                return (scraped_date - delta).isoformat()
        return None

class AIJobScraper:
    """Main scraper orchestration class"""
    def __init__(self, output_dir: str = 'json_data'):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)
        self.session = requests.Session()
        self.config = ScrapeConfig()
        self.existing_jobs = self._load_existing_jobs()
        
    def _load_existing_jobs(self) -> Set[str]:
        """Load set of existing job URLs"""
        existing_jobs = set()
        for f in self.output_dir.glob('*.json'):
            try:
                data = json.loads(f.read_text())
                existing_jobs.add(data['url'])
            except Exception as e:
                logging.warning(f"Error reading {f}: {e}")
        return existing_jobs
        
    def _generate_search_urls(self) -> List[str]:
        """Generate all search URL combinations"""
        urls = []
        for exp in self.config.EXPERIENCE_LEVELS:
            for cat in self.config.CATEGORIES:
                for city in self.config.CITIES:
                    url = f"{self.config.BASE_URL}/?"
                    url += f"cat={cat}"
                    url += f"&{'reg' if city == '5' else 'cit'}={city}"
                    url += f"&typ=1&key=&exp={exp}&sal="
                    urls.append(url)
        return urls
        
    def get_job_urls(self, search_url: str) -> List[str]:
        """Get all job URLs from a search page"""
        try:
            response = self.session.get(search_url)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, 'html.parser')
            
            if job_list := soup.find('ul', id='job-list'):
                return [
                    f"{self.config.BASE_URL}{link['href']}"
                    for link in job_list.find_all('a', href=lambda h: h and '/job/' in h)
                ]
        except Exception as e:
            logging.error(f"Error fetching job URLs: {e}")
        return []
        
    def scrape_job_details(self, url: str) -> Optional[JobData]:
        """Scrape details from a job page"""
        try:
            response = self.session.get(url)
            response.raise_for_status()
            parser = JobPageParser(response.text)
            return parser.parse_job_data(url)
        except Exception as e:
            logging.error(f"Error scraping job {url}: {e}")
        return None
        
    def save_job(self, job: JobData) -> None:
        """Save job data to JSON file"""
        try:
            safe_title = re.sub(r'[^\w\s-]', '', job.title)
            safe_company = re.sub(r'[^\w\s-]', '', job.company or 'unknown')
            filename = f"{safe_company}_{safe_title}_{hash(job.url)}.json"
            
            self.output_dir.joinpath(filename).write_text(
                json.dumps(vars(job), indent=2, ensure_ascii=False)
            )
            logging.info(f"Saved job: {job.title} at {job.company}")
            
        except Exception as e:
            logging.error(f"Error saving job: {e}")
            raise
            
    def scrape_jobs(self, test: bool = False) -> int:
        """Main scraping method"""
        jobs_found = jobs_skipped = 0
        
        for search_url in self._generate_search_urls():
            logging.debug(f"Processing search URL: {search_url}")
            
            job_urls = self.get_job_urls(search_url)
            if test and job_urls:
                job_urls = job_urls[:1]
                
            for url in job_urls:
                if url in self.existing_jobs:
                    jobs_skipped += 1
                    continue
                    
                if job_data := self.scrape_job_details(url):
                    self.save_job(job_data)
                    self.existing_jobs.add(url)
                    jobs_found += 1
                    
                time.sleep(1)  # Rate limiting
                
        logging.info(f"Found {jobs_found} new jobs. Skipped {jobs_skipped} existing jobs.")
        return {'jobs_found': jobs_found, 'jobs_skipped': jobs_skipped}

def scrape_jobs(test: bool = False, verbose: bool = False) -> int:
    """Convenience function to run the scraper"""
    if not verbose:
        logging.getLogger().setLevel(logging.WARNING)
        
    try:
        scraper = AIJobScraper(output_dir=Path.cwd() / 'json_data')
        return scraper.scrape_jobs(test=test)
    except Exception as e:
        logging.error(f"Scraper failed: {e}")
        raise

#logging.basicConfig(level=logging.INFO)
#scraper = AIJobScraper()
#scrape_results = scraper.scrape_jobs(test=False)
#print(f"Found {scrape_results['jobs_found']} new jobs, skipped {scrape_results['jobs_skipped']} existing jobs")

General-purpose LLM completion function

For the classification tasks, I’ll write a general-purpose LLM completion function. This function takes a prompt and a model as parameters, and returns the completion from the OpenAI API.

from openai import AsyncOpenAI
import os

async def get_completion(prompt: str, model: str = "gpt-4o-2024-08-06") -> str:
    """Get a completion from the OpenAI API."""
    client = AsyncOpenAI(api_key=os.getenv('OPENAI_API_KEY'))  # Initialize with API key from env vars
    response = await client.chat.completions.create(
        model=model,
        messages=[{"role": "system", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content

Extract Responsibilities, Skills, and Qualifications

The code below uses an LLM to extract key information from job postings that will be used to classify them into different role categories. It uses OpenAI’s GPT-4o model to analyze job descriptions and extract three key components: responsibilities, skills, and qualifications. The system processes multiple jobs concurrently with rate limiting, saves the extracted data as JSON files, and includes retry logic for handling API rate limits. The code uses asyncio for concurrent processing and includes error handling and logging. It also checks for previously processed jobs to avoid duplicate work.

Also included is code that implements data preprocessing steps. This code has functions to load JSON files into a pandas DataFrame, uses a list of keywords to filter out certain jobs that are not relevant to the analysis, and removes duplicate or highly similar job descriptions using TF-IDF vectorization and cosine similarity. The main function process_job_listings() combines these steps, taking a directory path, similarity threshold, and filter keywords as parameters. It returns a dictionary containing the processed DataFrame along with counts of the original and filtered entries.

Click to view the code for LLM extraction

import os
import json
import logging
import asyncio
import hashlib
import nest_asyncio
from typing import Dict, List, Optional
import pandas as pd
from bs4 import BeautifulSoup
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from openai import AsyncOpenAI, RateLimitError
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re

# Enable nested event loops for Jupyter notebooks
nest_asyncio.apply()

# Set up logging configuration
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

PROMPTS = {
    "prompt": """
    You will be given a Job Description.

    Your task is to extract a list of Responsibilities,
    Skills, and Qualifications. 

    - Responsibilities are the tasks and activities that the job requires the employee to perform.
    - Skills are the abilities and knowledge that the employee needs to have to perform the responsibilities.
    - Qualifications are the requirements that the employee needs to meet to be considered for the job.

    <JobDescription>
    {job_description}
    </JobDescription>

    Return a list of Responsibilities, Skills, and Qualifications as follows:
    <Responsibilities>
    [Bullet point list of responsibilities]
    </Responsibilities>
    <Skills>
    [Bullet point list of skills]
    </Skills>
    <Qualifications>
    [Bullet point list of qualifications]
    </Qualifications>
    """
}

class JobProcessor:
    def __init__(self, output_dir='json_extracted_data'):
        self.output_dir = output_dir
        os.makedirs(output_dir, exist_ok=True)
        self.semaphore = asyncio.Semaphore(3)
        self.processed = 0
        self.skipped = 0

    def _parse_llm_response(self, response: str) -> Dict[str, List[str]]:
        """Parse LLM response into structured data."""
        result = {
            "responsibilities": [],
            "skills": [],
            "qualifications": []
        }
        
        sections = {
            "responsibilities": r'<Responsibilities>\n(.*?)\n</Responsibilities>',
            "skills": r'<Skills>\n(.*?)\n</Skills>',
            "qualifications": r'<Qualifications>\n(.*?)\n</Qualifications>'
        }
        
        for section, pattern in sections.items():
            if match := re.search(pattern, response, re.DOTALL):
                result[section] = [
                    item.strip('- ') 
                    for item in match.group(1).strip().split('\n')
                    if item.strip('- ')  # Filter out empty items
                ]
        
        return result

    async def extract_job_details(self, job_id: str, job_description: str) -> Dict[str, List[str]]:
        """Extract structured information from job description using LLM."""
        try:
            prompt = PROMPTS["prompt"].format(job_description=job_description)
            response = await get_completion(prompt)
            result = self._parse_llm_response(response)
            
            # Save results
            json_path = os.path.join(self.output_dir, f'{job_id}.json')
            with open(json_path, 'w', encoding='utf-8') as f:
                json.dump(result, f, indent=2, ensure_ascii=False)
            
            return result
            
        except Exception as e:
            logging.error(f"Error extracting job details: {str(e)}")
            return {"responsibilities": [], "skills": [], "qualifications": []}

    async def process_single_job(self, job_id: str, description: str) -> Dict[str, List[str]]:
        """Process a single job with caching."""
        json_path = os.path.join(self.output_dir, f'{job_id}.json')
        
        if os.path.exists(json_path):
            self.skipped += 1
            logging.debug(f"Skipping job ID {job_id} - already processed")
            with open(json_path, 'r', encoding='utf-8') as f:
                return json.load(f)
        
        async with self.semaphore:
            self.processed += 1
            logging.info(f"Processing job ID: {job_id}")
            return await self.extract_job_details(job_id, description)

    async def process_jobs(self, df: pd.DataFrame) -> List[Dict[str, List[str]]]:
        """Process multiple jobs concurrently."""
        tasks = [
            self.process_single_job(str(row['id']), row['description'])
            for _, row in df.iterrows()
        ]
        
        results = await asyncio.gather(*tasks)
        
        print(f"\nProcessing complete:")
        print(f"- New jobs processed: {self.processed}")
        print(f"- Skipped jobs (already processed): {self.skipped}")
        
        return results

class DataPreprocessor:
    def __init__(self, similarity_threshold=0.8):
        self.similarity_threshold = similarity_threshold
        self.default_filter_keywords = [
            'Dir ', 'Director', 'Intern', 'Data Scientist', 'Data Science',
            'Content Writer', 'Faculty', 'Product Owner', 'Manager',
            'Analyst', 'Postdoctoral', 'Postdoc', 'Summer'
        ]

    @staticmethod
    def load_json_files(directory='json_data') -> pd.DataFrame:
        """Load JSON files into DataFrame with error handling."""
        df = pd.DataFrame()
        
        if not os.path.exists(directory):
            logging.error(f"Directory {directory} does not exist")
            return df
        
        for file in [f for f in os.listdir(directory) if f.endswith('.json')]:
            try:
                with open(os.path.join(directory, file), 'r') as f:
                    data = json.load(f)
                    df = pd.concat([df, pd.DataFrame([data])], ignore_index=True)
            except Exception as e:
                logging.error(f"Error loading {file}: {e}")
        
        if not df.empty:
            df['id'] = df['url'].apply(
                lambda x: f"j{hashlib.md5(x.encode()).hexdigest()[:5]}"
            )
        
        return df

    def filter_by_title(self, df: pd.DataFrame, 
                       filter_keywords: Optional[List[str]] = None) -> pd.DataFrame:
        """Filter DataFrame by job titles."""
        if df.empty:
            return df
            
        keywords = filter_keywords or self.default_filter_keywords
        return df[~df['title'].str.contains(
            '|'.join(keywords),
            case=False,
            na=False
        )]

    def remove_similar_descriptions(self, df: pd.DataFrame) -> pd.DataFrame:
        """Remove similar job descriptions using TF-IDF and cosine similarity."""
        if df.empty:
            return df
            
        tfidf = TfidfVectorizer(stop_words='english')
        tfidf_matrix = tfidf.fit_transform(df['description'].fillna(''))
        cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
        
        indices_to_remove = {
            j for i in range(len(df))
            for j in range(i + 1, len(df))
            if cosine_sim[i][j] > self.similarity_threshold
        }
        
        return df.iloc[~df.index.isin(list(indices_to_remove))]

def extract_jobs_from_dataframe(df: pd.DataFrame) -> List[Dict[str, List[str]]]:
    """Wrapper function to process jobs from a DataFrame."""
    processor = JobProcessor()
    
    loop = asyncio.get_event_loop()
    if loop.is_running():
        return loop.create_task(processor.process_jobs(df))
    else:
        return asyncio.run(processor.process_jobs(df))

def process_job_listings(directory='json_data', similarity_threshold=0.8, 
                        filter_keywords=None) -> Dict:
    """Main function to process job listings."""
    preprocessor = DataPreprocessor(similarity_threshold)
    
    # Load and preprocess data
    df = preprocessor.load_json_files(directory)
    initial_count = len(df)
    
    if df.empty:
        return {
            'processed_data': df,
            'original_count': 0,
            'filtered_count': 0
        }
    
    df = preprocessor.filter_by_title(df, filter_keywords)
    df_filtered = preprocessor.remove_similar_descriptions(df)
    
    return {
        'processed_data': df_filtered,
        'original_count': initial_count,
        'filtered_count': len(df_filtered)
    }

results = process_job_listings()
df = results['processed_data']
print(f'Original count of listings: {results["original_count"]}')
print(f'Filtered count of listings: {results["filtered_count"]}')
extract_jobs_from_dataframe(df)

Original count of listings: 999
Filtered count of listings: 680

<Task pending name='Task-1' coro=<JobProcessor.process_jobs() running at /tmp/ipykernel_1343524/2751672584.py:118>>

Classification

Classify jobs as relevant to GenAI/LLM work or not

Click to view the code for job classification as GenAI/LLM relevant or not

import nest_asyncio
import re
import asyncio
import logging
from typing import List, Dict, Set
from bs4 import BeautifulSoup
import pandas as pd
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from openai import AsyncOpenAI
import os
from openai import RateLimitError
import json

class JobClassifierGenAI:
    """Handles classification of jobs for GenAI/LLM work"""
    
    PROMPT = """
    You will be given a list of Responsibilities and
    Skills listed for a job. Your task is to determine
    if the job involves working with Generative AI (GenAI)
    or language models (a.k.a. Large Language Models (LLMs)).

    <Job>
    <Responsibilities>
    {responsibilities}
    </Responsibilities>
    <Skills>
    {skills}
    </Skills>
    </Job>

    Start by thinking step-by-step about the Job and its 
    Responsibilities and Skills, and whether it involves
    working with Generative AI (GenAI) or language models
    (a.k.a. Large Language Models (LLMs)).

    Return your response in the following format:
    <Analysis>
    [Your analysis of the job and its Responsibilities and Skills]
    </Analysis>
    <FinalAnswer>
    true|false
    </FinalAnswer>
    """

    def __init__(self, output_dir='role_genai_classifications', batch_size=3):
        self.output_dir = output_dir
        self.batch_size = batch_size
        self.semaphore = asyncio.Semaphore(batch_size)
        os.makedirs(output_dir, exist_ok=True)

    async def classify_job(self, job: Dict) -> Dict:
        """Classify a single job listing"""
        logging.info(f"Classifying job '{job['filename']}'")
        
        prompt = self.PROMPT.format(
            responsibilities=job['responsibilities'],
            skills=job.get('skills', '')
        )
        
        try:
            response = await get_completion(prompt)
            await asyncio.sleep(2)  # Rate limiting
        except Exception as e:
            logging.error(f"Failed to classify job '{job['filename']}': {str(e)}")
            return {
                **job,
                'analysis': f'Failed to process: {str(e)}',
                'is_genai_role': None
            }
        
        return self._parse_response(job, response)

    def _parse_response(self, job: Dict, response: str) -> Dict:
        """Parse LLM response into structured format"""
        soup = BeautifulSoup(f"<root>{response}</root>", 'lxml-xml')
        
        analysis = soup.find('Analysis')
        final_answer = soup.find('FinalAnswer')
        is_genai_role = None
        
        if final_answer:
            answer_text = final_answer.text.strip().lower()
            is_genai_role = True if answer_text == 'true' else False if answer_text == 'false' else None
        
        return {
            **job,
            'analysis': analysis.text.strip() if analysis else '',
            'is_genai_role': is_genai_role
        }

    def save_classification(self, job_id: str, result: Dict) -> None:
        """Save classification results to file"""
        filename = os.path.join(self.output_dir, f"{job_id}.json")
        with open(filename, 'w') as f:
            json.dump(result, f, indent=2)
        logging.info(f"Saved classification for job {job_id}")

    def get_classified_jobs(self) -> Set[str]:
        """Get set of already classified job IDs"""
        if not os.path.exists(self.output_dir):
            return set()
        return {f[:-5] for f in os.listdir(self.output_dir) if f.endswith('.json')}

    async def process_jobs_batch(self, jobs: List[Dict]) -> None:
        """Process a batch of jobs concurrently"""
        async def process_with_semaphore(job: Dict) -> None:
            async with self.semaphore:
                result = await self.classify_job(job)
                job_id = str(result['filename'])
                self.save_classification(job_id, result)
        
        await asyncio.gather(*[process_with_semaphore(job) for job in jobs])

    async def classify_jobs_async(self, df: pd.DataFrame) -> None:
        """Process all unclassified jobs in the DataFrame"""
        total_jobs = len(df)
        logging.info(f"Starting classification of {total_jobs} jobs")
        
        classified_jobs = self.get_classified_jobs()
        jobs_to_process = [
            job.to_dict() for idx, job in df.iterrows() 
            if str(job['filename']) not in classified_jobs
        ]
        
        await self.process_jobs_batch(jobs_to_process)
        logging.info(f"Completed classification of all {total_jobs} jobs")

    def classify_jobs(self, df: pd.DataFrame) -> None:
        """Main entry point for job classification"""
        if df.empty:
            logging.warning("Empty DataFrame provided")
            return
            
        if 'filename' not in df.columns:
            logging.error("DataFrame missing required 'filename' column")
            return
        
        classified_jobs = self.get_classified_jobs()
        logging.info(f"Found {len(classified_jobs)} previously classified jobs")
        
        new_jobs = df[~df['filename'].isin(classified_jobs)]
        if new_jobs.empty:
            logging.info("No new jobs to classify")
            return
        
        logging.info(f"Processing {len(new_jobs)} new jobs")
        logging.info(f"Skipping {len(df) - len(new_jobs)} existing jobs")
        
        loop = asyncio.get_event_loop()
        loop.run_until_complete(self.classify_jobs_async(new_jobs))

class JobDataLoader:
    """Handles loading and preprocessing of job data"""
    
    @staticmethod
    def read_json_files(json_dir='json_extracted_data') -> List[Dict]:
        """Read job data from JSON files"""
        result = []
        
        for filename in os.listdir(json_dir):
            if filename.endswith('.json'):
                file_path = os.path.join(json_dir, filename)
                try:
                    with open(file_path, 'r') as f:
                        data = json.load(f)
                        name = filename[:-5]
                        if 'responsibilities' in data and 'skills' in data:
                            result.append({
                                'filename': name,
                                'responsibilities': data['responsibilities'],
                                'skills': data['skills']
                            })
                except json.JSONDecodeError:
                    logging.error(f"Invalid JSON in {filename}")
                except Exception as e:
                    logging.error(f"Error processing {filename}: {str(e)}")
        
        return result

    @staticmethod
    def load_classifications(input_dir='role_genai_classifications') -> pd.DataFrame:
        """Load classification results into DataFrame"""
        if not os.path.exists(input_dir):
            return pd.DataFrame()
            
        all_results = []
        for filename in os.listdir(input_dir):
            if filename.endswith('.json'):
                with open(os.path.join(input_dir, filename), 'r') as f:
                    classification = json.load(f)
                    all_results.append(classification)
        
        return pd.DataFrame(all_results)

loader = JobDataLoader()
jobs = loader.read_json_files()
df = pd.DataFrame(jobs)

#classifier = JobClassifierGenAI()
#classifier.classify_jobs(df)

# Load results
#results_df = loader.load_classifications()

Classify GenAI/LLM jobs into pre-defined categories

For this next classification task, I’ll make the assumption that there are four types of AI engineering and science roles that are relevant to work with GenAI systems. These are:

AI Research Scientist
AI/ML Engineer
MLOps / AI Infrastructure Engineer
AI Solution Architect

I’ll also include other categories that are not of interest, but may improve classification accuracy. These are:

Data Scientist
Data Engineer
Product Manager
Software Engineer

The code below implements an automated job classification system that uses an LLM to categorize job postings into the eight predefined roles listed above (four GenAI-focused and four related roles). It consists of several classes that work together: JobClassifier handles the core classification logic by comparing job descriptions against detailed role templates, JobData and ClassificationResult provide structured data containers, and JobProcessor manages the overall pipeline from loading jobs to saving results. The system processes jobs concurrently using asyncio, includes error handling and rate limiting, and outputs both an analysis explaining the classification and a final numerical category (0-8) for each job, with all results saved as JSON files for further analysis.

Definitions of the roles can be found in the JOB_DESCRIPTIONS variable.

Click to view the code for job classification into pre-defined roles

from dataclasses import dataclass
from typing import List, Dict, Set, Optional
import logging
import json
import asyncio
import os
from bs4 import BeautifulSoup
import pandas as pd
import nest_asyncio

JOB_DESCRIPTIONS = """
<Option title="AI Research Scientist" number="1">
  <PrimaryFocus>
    Investigate and adapt cutting-edge AI methodologies (e.g., generative models, advanced prompt engineering) for applications.
  </PrimaryFocus>
  <KeyResponsibilities>
    Conduct experiments to evaluate the performance (e.g., quality, accuracy) of new AI approaches and refine existing models.
    Collaborate with AI/ML Engineers to transition successful prototypes into production.
    Stay current with the latest AI research and emerging trends in generative AI.
    Develop human-annotated datasets for training and evaluation of AI models.
  </KeyResponsibilities>
  <SkillsAndTools>
    Deep understanding of LLMs and prompt engineering.
    Strong background in statistics, optimization, or related fields.
    Knowledge of experimental methods (e.g., A/B testing) and hypothesis testing.
    Knowledge of LLM evaluation methods, including algorithmic evals, human evals, or LLM-as-a-judge evals.
  </SkillsAndTools>
</Option>
<Option title="AI/ML Engineer" number="2">
  <PrimaryFocus>
    Transform research output into robust, scalable AI solutions for the product or internal use.
  </PrimaryFocus>
  <KeyResponsibilities>
    Productionize AI models, ensuring they meet performance and reliability requirements.
    Develop and maintain data pipelines for model training, inference, and monitoring.
    Collaborate closely with Research Scientists to optimize and refine model implementations.
  </KeyResponsibilities>
  <SkillsAndTools>
    Proficiency in Python, Go, or similar languages.
    Experience with API development and integration (REST, GraphQL).
    Working knowledge of software engineering best practices (version control, testing, CI/CD).
  </SkillsAndTools>
</Option>
<Option title="MLOps / AI Infrastructure Engineer" number="3">
  <PrimaryFocus>
    Ensure reliable deployment, scaling, and monitoring of AI systems in production.
  </PrimaryFocus>
  <KeyResponsibilities>
    Set up CI/CD pipelines tailored for AI workflows, including model versioning and data governance.
    Monitor production models for performance, latency, and data drift, implementing necessary updates.
    Manage infrastructure for scalable AI deployments (Docker, Kubernetes, cloud services).
  </KeyResponsibilities>
  <SkillsAndTools>
    Strong DevOps background, with tools like Docker, Kubernetes, and Terraform.
    Familiarity with ML orchestration/monitoring tools (MLflow, Airflow, Prometheus).
    Experience optimizing compute usage (GPU/TPU) for cost-effective scaling.
  </SkillsAndTools>
</Option>
<Option title="AI Solution Architect" number="4">
  <PrimaryFocus>
    Design and orchestrate AI solutions leveraging generative models and LLM technologies to create impactful experiences and solutions that align with business objectives.
  </PrimaryFocus>
  <KeyResponsibilities>
    Collaborate with subject matter experts (SMEs) to identify and refine opportunities for generative AI/LLM-based use cases.
    Assess feasibility and define high-level solution architectures, ensuring they address core business and user requirements.
    Develop technical proposals and roadmaps, translating complex requirements into actionable plans.
    Provide thought leadership on conversational design, user experience flow, and model interaction strategies.
    Ensure solutions comply with relevant data governance, privacy, and security considerations.
    Facilitate cross-functional collaboration, guiding teams through solution conceptualization and implementation phases.
  </KeyResponsibilities>
  <SkillsAndTools>
    Strong understanding of LLM capabilities and prompt engineering principles.
    Experience with conversational experience design (e.g., chatbots, voice interfaces) and user journey mapping.
    Ability to analyze business needs and translate them into feasible AI solution proposals.
    Familiarity with data privacy and security best practices, especially as they pertain to AI solutions.
    Excellent communication and stakeholder management skills to align technical and non-technical teams.
  </SkillsAndTools>
</Option>
<Option title="Data Scientist" number="5"> 
  <PrimaryFocus>
    Leverage statistical analysis, machine learning, and data visualization to derive actionable insights and guide data-informed decisions.
  </PrimaryFocus> 
  <KeyResponsibilities> 
    Perform exploratory data analysis (EDA) to identify trends and patterns in large, complex datasets. 
    Develop and validate predictive and prescriptive models, collaborating with cross-functional teams to implement these solutions. 
    Design and execute experiments to test hypotheses, measure impact, and inform business strategies. 
    Present findings and recommendations to stakeholders in a clear, concise manner using visualizations and dashboards. 
    Work with data engineers to ensure data quality, governance, and availability. 
  </KeyResponsibilities> 
  <SkillsAndTools> 
    Proficiency in Python, R, or SQL for data manipulation and analysis. 
    Experience with common ML libraries (e.g., scikit-learn, XGBoost) and deep learning frameworks (e.g., PyTorch, TensorFlow). 
    Solid grounding in statistics, probability, and experimental design. 
    Familiarity with data visualization tools (e.g., Tableau, Power BI) for communicating insights. 
    Strong analytical thinking and ability to translate complex data problems into business solutions. 
  </SkillsAndTools> 
</Option>
<Option title="Data Engineer" number="6">
  <PrimaryFocus>
    Design, build, and maintain scalable data pipelines and architectures that enable efficient data collection, storage, and analysis.
  </PrimaryFocus>
  <KeyResponsibilities>
    Develop and optimize data ingestion and transformation processes (ETL/ELT), ensuring high performance and reliability.
    Implement and manage data workflows, integrating internal and external data sources.
    Collaborate with Data Scientists, AI/ML Engineers, and other stakeholders to ensure data readiness for analytics and model training.
    Monitor data pipelines for performance, reliability, and cost-effectiveness, taking corrective actions when needed.
    Maintain data quality and governance standards, including metadata management and data cataloging.
  </KeyResponsibilities>
  <SkillsAndTools>
    Proficiency in Python, SQL, and distributed data processing frameworks (e.g., Spark, Kafka).
    Experience with cloud-based data ecosystems (AWS, GCP, or Azure), and related storage/processing services (e.g., S3, BigQuery, Dataflow).
    Familiarity with infrastructure-as-code and DevOps tools (Terraform, Docker, Kubernetes) for automating data platform deployments.
    Strong understanding of database systems (relational, NoSQL) and data modeling principles.
    Knowledge of data orchestration and workflow management tools (Airflow, Luigi, Dagster).
  </SkillsAndTools>
</Option>
<Option title="Product Manager" number="7">
  <PrimaryFocus>
    Drive the product vision and strategy, ensuring alignment with business goals and user needs while delivering impactful AI-driven solutions.
  </PrimaryFocus>
  <KeyResponsibilities>
    Conduct user and market research to identify opportunities, define product requirements, and set success metrics.
    Collaborate with cross-functional teams (Engineering, Data Science, Design) to prioritize features and plan releases.
    Develop and communicate product roadmaps, ensuring stakeholders are aligned on goals and timelines.
    Monitor product performance through data analysis and user feedback, iterating on improvements and new feature ideas.
    Facilitate agile development practices, writing clear user stories and acceptance criteria.
  </KeyResponsibilities>
  <SkillsAndTools>
    Strong understanding of product lifecycle management and agile methodologies (Scrum/Kanban).
    Excellent communication, negotiation, and stakeholder management skills.
    Experience with product management and collaboration tools (e.g., Jira, Confluence, Trello).
    Analytical mindset for leveraging metrics, A/B testing, and user feedback in decision-making.
    Familiarity with AI/ML concepts and the ability to translate technical possibilities into viable product features.
  </SkillsAndTools>
</Option>
<Option title="Software Engineer" number="8">
  <PrimaryFocus>
  Design, develop, and maintain high-quality software applications and services that address user needs and align with overall business objectives.
  </PrimaryFocus>
  <KeyResponsibilities>
    Collaborate with cross-functional teams (Product, Design, QA) to interpret requirements and deliver robust solutions.
    Write clean, efficient, and testable code following best practices and coding standards.
    Participate in system architecture and design discussions, contributing to the evolution of technical roadmaps.
    Perform code reviews and provide constructive feedback to peers, maintaining a high bar for code quality.
    Implement and maintain CI/CD pipelines to streamline deployment and reduce manual interventions.
    Continuously improve system performance and scalability through profiling and optimization.
  </KeyResponsibilities>
  <SkillsAndTools>
    Proficiency in one or more programming languages (e.g., Java, Python, JavaScript, C++).
    Experience with modern frameworks/libraries (e.g., Spring Boot, Node.js, React, Django).
    Solid understanding of software design principles (e.g., SOLID, DRY) and architectural patterns (e.g., microservices).
    Familiarity with version control systems (Git), testing frameworks, and agile methodologies.
    Working knowledge of containerization (Docker), orchestration (Kubernetes), and cloud platforms (AWS, Azure, GCP).
  </SkillsAndTools>
</Option>
"""

PROMPTS = {
"prompt": """
You will be given a list of Responsibilities and
Skills listed for a job. Your task is to determine
if the job is a good fit with any of the Options,
and if so, which one.

<Job>
<Responsibilities>
{responsibilities}
</Responsibilities>
<Skills>
{skills}
</Skills>
</Job>

<Options>
{Options}
</Options>

Start by thinking step-by-step about the Job and its 
Responsibilities and Skills, in relation to each of the 
Options.

Decide if the Job is a good fit with ANY of the Options.
If NONE of the Options are relevant to the Job, say so and 
return a 0 as your FinalAnswer.

Otherwise, decide which of the Options is the most similar
to the Job and return its number as your FinalAnswer.

Return your response in the following format:
<Analysis>
[Your analysis of the job and its Responsibilities and Skills, in relation each of the Options]
</Analysis>
<FinalAnswer>
0|1|2|3|4|5|6|7|8
</FinalAnswer>
"""
}

# Enable nested event loops
nest_asyncio.apply()

@dataclass
class JobData:
    """Represents a job posting with extracted information."""
    filename: str
    responsibilities: List[str]
    skills: List[str]

@dataclass
class ClassificationResult:
    """Represents the result of a job classification."""
    filename: str
    responsibilities: List[str]
    skills: List[str]
    analysis: str
    role_classification: Optional[int]
    role_title: Optional[str]

class JobClassifier:
    """Handles classification of jobs into predefined roles."""
    
    def __init__(self, output_dir: str = 'role_classifications', batch_size: int = 3):
        self.output_dir = output_dir
        self.batch_size = batch_size
        self.semaphore = asyncio.Semaphore(batch_size)
        os.makedirs(output_dir, exist_ok=True)
        
    async def classify_job(self, job: JobData) -> ClassificationResult:
        """Classify a single job listing."""
        logging.info(f"Classifying job '{job.filename}'")
        
        prompt = PROMPTS["prompt"].format(
            responsibilities=job.responsibilities,  # Access as attribute instead of dict
            skills=job.skills,  # Access as attribute instead of dict
            Options=JOB_DESCRIPTIONS
        )
        
        try:
            response = await get_completion(prompt)
            await asyncio.sleep(5)  # Rate limiting
            return self._parse_response(job, response)
        except Exception as e:
            logging.error(f"Failed to classify job '{job.filename}': {str(e)}")
            return ClassificationResult(
                filename=job.filename,
                responsibilities=job.responsibilities,
                skills=job.skills,
                analysis=f'Failed to process: {str(e)}',
                role_classification=None,
                role_title=None
            )
    
    def _parse_response(self, job: JobData, response: str) -> ClassificationResult:
        """Parse LLM response into structured format."""
        soup = BeautifulSoup(f"<root>{response}</root>", 'lxml-xml')
        
        analysis = soup.find('Analysis')
        role_choice = soup.find('FinalAnswer')
        role_number = int(role_choice.text.strip()) if role_choice else None
        
        role_title = self._get_role_title(role_number)
        
        return ClassificationResult(
            filename=job.filename,
            responsibilities=job.responsibilities,
            skills=job.skills,
            analysis=analysis.text.strip() if analysis else '',
            role_classification=role_number,
            role_title=role_title
        )
    
    def _get_role_title(self, role_number: Optional[int]) -> Optional[str]:
        """Get the title for a role number."""
        if role_number is None:
            return None
        if role_number == 0:
            return "Other"
            
        wrapped_xml = f"<root>{JOB_DESCRIPTIONS}</root>"
        job_descriptions_soup = BeautifulSoup(wrapped_xml, 'lxml-xml')
        matching_job = job_descriptions_soup.find('Option', {'number': str(role_number)})
        
        if matching_job:
            return matching_job['title']
        logging.error(f"No matching job found for role number {role_number}")
        return None

class JobProcessor:
    """Handles the processing of job data files."""
    
    def __init__(self, input_dir: str = 'json_extracted_data'):
        self.input_dir = input_dir
        self.classifier = JobClassifier()
        
    def load_jobs(self) -> List[JobData]:
        """Load jobs from JSON files."""
        jobs = []
        # Get list of all JSON files
        total_files = len([f for f in os.listdir(self.input_dir) if f.endswith('.json')])
        # Get files that were classified as GenAI-relevant
        classified_files = self._get_classified_files()
        # Get files that have already been processed
        processed_files = {
            f[:-5] for f in os.listdir(self.classifier.output_dir) 
            if f.endswith('.json')
        }
        # Get relevant files that haven't been processed yet
        files_to_process = classified_files - processed_files
        
        logging.info(f"Found {total_files} total files")
        logging.info(f"Found {len(classified_files)} relevant GenAI files")
        logging.info(f"Already processed: {len(processed_files)} files")
        logging.info(f"Remaining to process: {len(files_to_process)} files")
        
        # Only process files that are both relevant and unprocessed
        for filename in os.listdir(self.input_dir):
            if not filename.endswith('.json'):
                continue
                
            name = filename[:-5]
            if name not in files_to_process:
                continue
                
            try:
                job = self._load_job_file(filename)
                if job:
                    jobs.append(job)
            except Exception as e:
                logging.error(f"Error processing {filename}: {str(e)}")
        
        logging.info(f"Processing {len(jobs)} remaining jobs")
        return jobs
    
    def _load_job_file(self, filename: str) -> Optional[JobData]:
        """Load and parse a single job file."""
        file_path = os.path.join(self.input_dir, filename)
        try:
            with open(file_path, 'r') as f:
                data = json.load(f)
                if 'responsibilities' in data and 'skills' in data:
                    return JobData(
                        filename=filename[:-5],
                        responsibilities=data['responsibilities'],
                        skills=data['skills']
                    )
        except json.JSONDecodeError:
            logging.error(f"Invalid JSON in {filename}")
        return None
    
    def _get_classified_files(self) -> Set[str]:
        """Get set of files that have been previously classified as GenAI-related."""
        genai_dir = 'role_genai_classifications'
        if not os.path.exists(genai_dir):
            return set()
        
        genai_files = set()
        for f in os.listdir(genai_dir):
            if f.endswith('.json'):
                try:
                    with open(os.path.join(genai_dir, f), 'r') as file:
                        data = json.load(file)
                        if data.get('is_genai_role') is True:  # Explicitly check for True
                            genai_files.add(f[:-5])
                except Exception as e:
                    logging.error(f"Error reading {f}: {e}")
        return genai_files
    
    async def process_jobs(self) -> None:
        """Process all jobs."""
        jobs = self.load_jobs()
        if not jobs:
            return
            
        logging.info(f"Processing {len(jobs)} relevant jobs")
        await self._process_jobs_batch(jobs)
        logging.info("Job classification complete")
    
    async def _process_jobs_batch(self, jobs: List[JobData]) -> None:
        """Process a batch of jobs concurrently."""
        async def process_with_semaphore(job: JobData) -> None:
            async with self.classifier.semaphore:
                result = await self.classifier.classify_job(job)
                self._save_result(result)
        
        await asyncio.gather(*[process_with_semaphore(job) for job in jobs])
    
    def _save_result(self, result: ClassificationResult) -> None:
        """Save classification result to file."""
        filename = os.path.join(self.classifier.output_dir, f"{result.filename}.json")
        with open(filename, 'w') as f:
            json.dump(vars(result), f, indent=2)
        logging.info(f"Saved classification for job {result.filename}")

def classify_job_roles(df: pd.DataFrame) -> None:
    """Main entry point for job classification."""
    if df.empty:
        logging.warning("Empty DataFrame provided")
        return
        
    processor = JobProcessor()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(processor.process_jobs())

import logging
logging.basicConfig(level=logging.INFO)
loader = JobDataLoader()
jobs = loader.read_json_files()
df = pd.DataFrame(jobs)
classify_job_roles(df)

2025-01-24 13:50:53,982 - INFO - Found 739 total files
2025-01-24 13:50:53,982 - INFO - Found 247 relevant GenAI files
2025-01-24 13:50:53,983 - INFO - Already processed: 247 files
2025-01-24 13:50:53,983 - INFO - Remaining to process: 0 files
2025-01-24 13:50:53,984 - INFO - Processing 0 remaining jobs

Load final data

Now I’ll load and combine all of the different data outputs into a single dataframe.

The code block below uses a DataLoader class with helper classes DataPaths and SalaryProcessor to handle the data processing pipeline. The code merges job posting data with role classifications, processes salary information by converting CAD to USD and extracting salary ranges, and filters the results to only include the four specific AI-related roles of interest: AI Research Scientist, AI Solution Architect, AI/ML Engineer, and MLOps/AI Infrastructure Engineer. The code includes error handling for JSON file loading and salary processing. The final output is a pandas DataFrame containing the merged and processed data, which includes job details, role classifications, and standardized salary information.

Click to view the code for loading and merging all data sources

from typing import List, Tuple, Optional, Dict
import pandas as pd
import json
import hashlib
from dataclasses import dataclass
from pathlib import Path

@dataclass
class DataPaths:
    """Configuration for data file paths"""
    ROLE_CLASSIFICATIONS_DIR: Path = Path('role_classifications')
    JSON_DATA_DIR: Path = Path('json_data')

class DataLoader:
    def __init__(self, paths: DataPaths = DataPaths()):
        self.paths = paths
        self.salary_processor = SalaryProcessor()

    def load_and_merge_data(self) -> pd.DataFrame:
        """Load and merge all data sources"""
        # Load and merge base data
        merged_df = self._load_base_data()
        
        # Process salaries
        salary_data = self.salary_processor.extract_salary_ranges(merged_df['salary'])
        merged_df = pd.concat([merged_df, salary_data], axis=1)

        # Filter out any jobs not in this list
        merged_df = merged_df[merged_df['role_title'].isin(
            ['AI Research Scientist',
            'AI Solution Architect',
            'AI/ML Engineer',
            'MLOps / AI Infrastructure Engineer']
        )]
        
        print(f"Shape of merged DataFrame: {merged_df.shape}")
        return merged_df

    def _load_base_data(self) -> pd.DataFrame:
        """Load and merge base data sources"""
        # Load role classifications
        role_data = pd.DataFrame(self._load_json_files(self.paths.ROLE_CLASSIFICATIONS_DIR))
        
        # Load and process jobs data
        jobs_data = pd.DataFrame(self._load_json_files(self.paths.JSON_DATA_DIR))
        jobs_data['filename'] = jobs_data['url'].apply(
            lambda x: f"j{hashlib.md5(x.encode()).hexdigest()[:5]}"
        )
        
        # Merge dataframes
        merged_df = pd.merge(jobs_data, role_data, on='filename', how='left')
        return merged_df

    def _load_json_files(self, directory: Path) -> List[Dict]:
        """Load JSON files from directory"""
        json_files = []
        for file_path in directory.glob('*.json'):
            try:
                with open(file_path, 'r') as f:
                    json_files.append(json.load(f))
            except Exception as e:
                print(f"Error reading {file_path}: {e}")
        return json_files

class SalaryProcessor:
    """Handles salary-related processing"""
    CAD_TO_USD_RATE = 1.44  # 1 USD = 1.44 CAD

    def extract_salary_ranges(self, salary_series: pd.Series) -> pd.DataFrame:
        """Extract salary ranges from a series of salary strings"""
        salary_data = salary_series.apply(self._extract_salary_range)
        return pd.DataFrame({
            'min_salary': salary_data.apply(lambda x: x[0]),
            'max_salary': salary_data.apply(lambda x: x[1]),
            'mid_salary': salary_data.apply(lambda x: (x[0] + x[1])/2 if x[0] and x[1] else None)
        })

    def _extract_salary_range(self, salary_str: str) -> Tuple[Optional[float], Optional[float]]:
        """Extract minimum and maximum salary from salary string"""
        try:
            if not isinstance(salary_str, str) or '0+' in salary_str:
                return None, None
            
            # Determine currency and convert if needed
            is_cad = 'CAD' in salary_str
            nums = (salary_str.replace('CAD ', '')
                            .replace('USD ', '')
                            .replace('K', '')
                            .split(' - '))
            
            if len(nums) != 2:
                return None, None
            
            # Convert to float and apply CAD conversion if needed
            min_salary = float(nums[0])
            max_salary = float(nums[1])
            
            if is_cad:
                min_salary *= self.CAD_TO_USD_RATE
                max_salary *= self.CAD_TO_USD_RATE
                
            return min_salary, max_salary
        except Exception:
            return None, None

# Initialize the loader and load the data
# Check if the csv file exists
if not os.path.exists('genai_job_data.csv'):
    loader = DataLoader()
    df = loader.load_and_merge_data()
    df.to_csv('genai_job_data.csv', index=False)
else:
    df = pd.read_csv('genai_job_data.csv')

# Now df contains all the merged data with processed salaries
print(f"Total jobs loaded: {len(df)}")

# Example operations you can do with the DataFrame:

# View available columns
print("\nAvailable columns:")
print(df.columns.tolist())

# View distribution of roles
print("\nRole distribution:")
print(df['role_title'].value_counts())

Total jobs loaded: 187

Available columns:
['title', 'company', 'location', 'level', 'salary', 'url', 'description', 'scraped_date', 'posted_date', 'raw_data', 'filename', 'responsibilities', 'skills', 'analysis', 'role_classification', 'role_title', 'min_salary', 'max_salary', 'mid_salary']

Role distribution:
role_title
AI/ML Engineer                        60
AI Research Scientist                 52
AI Solution Architect                 48
MLOps / AI Infrastructure Engineer    27
Name: count, dtype: int64

Salary analysis

First I’ll analyze the salary data for each role and experience level. I’ll use a RandomForest model to adjust for location differences, and convert any CAD denominated salaries to USD. I’m interested to see what compensation is like for each role, and if there are any significant differences in compensation between the roles.

The code below implements a salary analysis system for the job posting data. It processes salary information across different roles and experience levels, handling currency conversions (CAD to USD) and adjusting for location differences using a RandomForest model. For each role and level combination, it calculates salary ranges, removes statistical outliers, and provides sample sizes. The system includes configuration settings for analysis parameters and comprehensive error handling. The code uses pandas for data manipulation and scikit-learn for the location adjustment model, with results formatted as salary ranges in USD.

Click to view the code for salary analysis

from dataclasses import dataclass
from typing import Dict, List, Optional, TypedDict, Set
import logging
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

@dataclass
class SalaryConfig:
    """Configuration settings for salary analysis"""
    outlier_threshold: float = 1.5
    n_estimators: int = 100
    min_group_size: int = 3
    currency_symbol: str = '$'
    cv_folds: int = 5

class SalaryColumns(TypedDict):
    """Type definitions for salary DataFrame columns"""
    min_salary_usd_k: float
    max_salary_usd_k: float
    median_salary_usd_k: float
    sample_size: int
    location_adjusted_zscore: float

class SalaryAnalyzer:
    """Analyzes salary distributions across role titles.
    
    Column Naming Conventions:
    - *_usd_k: Values in thousands of USD
    - *_zscore: Standardized scores (mean=0, std=1)
    - sample_size: Number of data points in group
    - *_median: Median values
    - location_adjusted_*: Values adjusted for location differences
    
    Attributes:
        config: Configuration settings for analysis
        logger: Logger instance for the class
    """
    
    # Input column names
    COL_MIN_SALARY = 'min_salary'
    COL_MAX_SALARY = 'max_salary'
    COL_ROLE_TITLE = 'role_title'
    COL_LEVEL = 'level'
    COL_LOCATION = 'location'
    
    # Processed column names
    COL_MEDIAN_SALARY = 'median_salary_usd_k'
    COL_MIN_SALARY_MEDIAN = 'min_salary_median_usd_k'
    COL_MAX_SALARY_MEDIAN = 'max_salary_median_usd_k'
    COL_SAMPLE_SIZE = 'sample_size'
    
    # Z-score columns
    COL_MIN_ZSCORE = 'min_salary_location_adjusted_zscore'
    COL_MID_ZSCORE = 'median_salary_location_adjusted_zscore'
    COL_MAX_ZSCORE = 'max_salary_location_adjusted_zscore'
    
    def __init__(self, config: Optional[SalaryConfig] = None):
        """Initialize SalaryAnalyzer with optional configuration."""
        self.config = config or SalaryConfig()
        self.logger = logging.getLogger(__name__)
        self._model = None  # Cache for trained model
        
    def analyze_role_salaries(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Analyze salary distributions across role titles with location adjustment.
        
        Args:
            df: DataFrame containing columns:
                - min_salary: Minimum salary (float)
                - max_salary: Maximum salary (float)
                - role_title: Job role (str)
                - level: Experience level (str)
                - location: Job location (str)
                
        Returns:
            DataFrame with aggregated salary statistics by role and level
            
        Raises:
            ValueError: If required columns are missing or DataFrame is empty
        """
        self._validate_input(df)
        self.logger.info(f"Starting salary analysis with {len(df)} records")
        
        # Create working copy of DataFrame
        analysis_df = self._prepare_data(df)
        
        # Process and clean data
        analysis_df = self._process_data(analysis_df)
        
        # Generate final results
        results = self._aggregate_results(analysis_df)
        
        self.logger.info("Salary analysis completed successfully")
        return results
    
    def _validate_input(self, df: pd.DataFrame) -> None:
        """Validate input DataFrame."""
        if df is None or df.empty:
            raise ValueError("Input DataFrame cannot be None or empty")
            
        required_cols = {
            self.COL_MIN_SALARY, 
            self.COL_MAX_SALARY, 
            self.COL_ROLE_TITLE,
            self.COL_LEVEL,
            self.COL_LOCATION
        }
        
        missing_cols = required_cols - set(df.columns)
        if missing_cols:
            raise ValueError(f"DataFrame missing required columns: {missing_cols}")
    
    def _prepare_data(self, df: pd.DataFrame) -> pd.DataFrame:
        """Prepare data for analysis by filtering valid salaries."""
        valid_salary_mask = (
            df[self.COL_MIN_SALARY].notna() & 
            df[self.COL_MAX_SALARY].notna()
        )
        
        if not valid_salary_mask.any():
            self.logger.warning("No valid salary data found")
            return pd.DataFrame()
            
        # Create working copy and calculate median salary
        analysis_df = df[valid_salary_mask].copy()
        analysis_df[self.COL_MEDIAN_SALARY] = (
            analysis_df[[self.COL_MIN_SALARY, self.COL_MAX_SALARY]].mean(axis=1)
        )
        
        return analysis_df

    def _remove_outliers(self, group: pd.DataFrame) -> pd.DataFrame:
        """Remove statistical outliers from salary data."""
        if len(group) <= self.config.min_group_size:
            return group
            
        # Only check min and max salary for outliers
        salary_cols = [self.COL_MIN_SALARY, self.COL_MAX_SALARY]
        
        for col in salary_cols:
            Q1 = group[col].quantile(0.25)
            Q3 = group[col].quantile(0.75)
            IQR = Q3 - Q1
            
            outlier_mask = ~(
                (group[col] < (Q1 - self.config.outlier_threshold * IQR)) | 
                (group[col] > (Q3 + self.config.outlier_threshold * IQR))
            )
            group = group[outlier_mask]
            
        # Recalculate median after removing outliers
        group[self.COL_MEDIAN_SALARY] = (
            group[self.COL_MIN_SALARY] + group[self.COL_MAX_SALARY]
        ) / 2
            
        return group

    def _aggregate_results(self, df: pd.DataFrame) -> pd.DataFrame:
        """Aggregate and format the final results."""
        # First calculate group statistics
        grouped = df.groupby([self.COL_ROLE_TITLE, self.COL_LEVEL]).agg({
            self.COL_MIN_SALARY: ['count', 'median'],
            self.COL_MAX_SALARY: 'median',
        }).round(0)
        
        # Calculate median salary from min and max medians
        grouped['median_salary'] = (
            grouped[(self.COL_MIN_SALARY, 'median')] + 
            grouped[(self.COL_MAX_SALARY, 'median')]
        ) / 2
        
        # Flatten and rename columns
        grouped.columns = [
            'sample_size' if col == (self.COL_MIN_SALARY, 'count')
            else 'min_salary_usd_k' if col == (self.COL_MIN_SALARY, 'median')
            else 'max_salary_usd_k' if col == (self.COL_MAX_SALARY, 'median')
            else 'median_salary_usd_k' if col[0] == 'median_salary'
            else col
            for col in grouped.columns
        ]
        
        # Order columns
        ordered_cols = [
            'sample_size',
            'min_salary_usd_k',
            'max_salary_usd_k',
            'median_salary_usd_k'
        ]
        result = grouped.reindex(columns=ordered_cols)
        
        # Format salary values
        salary_cols = [col for col in ordered_cols if col.endswith('_usd_k')]
        for col in salary_cols:
            result[col] = result[col].apply(
                lambda x: f"{self.config.currency_symbol}{x:,.0f}K" if pd.notna(x) else "N/A"
            )
        
        return result

    
    def _process_data(self, df: pd.DataFrame) -> pd.DataFrame:
        """Process and clean the salary data."""
        # Remove outliers by group
        df = (df.groupby([self.COL_ROLE_TITLE, self.COL_LEVEL])
              .apply(self._remove_outliers)
              .reset_index(drop=True))
        
        # Add location-adjusted scores
        df = self._add_location_adjusted_scores(df)
        
        # Clean level names
        df[self.COL_LEVEL] = (df[self.COL_LEVEL]
                             .str.replace('-Level / Level', '', regex=False)
                             .str.replace('-level', '', regex=False))
        
        return df
    
    def _train_model(self) -> RandomForestRegressor:
        """Train and cache RandomForest model."""
        if self._model is None:
            self._model = RandomForestRegressor(
                n_estimators=self.config.n_estimators,
                random_state=42
            )
        return self._model
    
    def _add_location_adjusted_scores(self, df: pd.DataFrame) -> pd.DataFrame:
        """Add location-adjusted z-scores to the dataframe."""
        location_dummies = pd.get_dummies(df[self.COL_LOCATION], prefix='loc')
        model = self._train_model()
        
        score_columns = {
            self.COL_MIN_SALARY: self.COL_MIN_ZSCORE,
            self.COL_MEDIAN_SALARY: self.COL_MID_ZSCORE,
            self.COL_MAX_SALARY: self.COL_MAX_ZSCORE
        }
        
        for source_col, target_col in score_columns.items():
            df[target_col] = self._adjust_salaries(
                df[source_col], 
                location_dummies,
                model
            )
        
        return df
    
    def _adjust_salaries(self, 
                        salary_series: pd.Series,
                        X: pd.DataFrame,
                        model: RandomForestRegressor) -> pd.Series:
        """Adjust salaries using RandomForest model."""
        # Evaluate model performance
        scores = cross_val_score(
            model, X, salary_series, 
            cv=self.config.cv_folds
        )
        self.logger.debug(f"Cross-validation scores: {scores.mean():.3f} ± {scores.std():.3f}")
        
        # Fit model and calculate residuals
        model.fit(X, salary_series)
        expected = model.predict(X)
        residuals = salary_series - expected
        
        # Return standardized residuals
        return (residuals - residuals.mean()) / residuals.std()

salary_analyzer = SalaryAnalyzer()
salary_df = salary_analyzer.analyze_role_salaries(df)
salary_df

2025-01-24 13:50:54,315 - INFO - Starting salary analysis with 187 records
2025-01-24 13:50:56,541 - INFO - Salary analysis completed successfully

		sample_size	min_salary_usd_k	max_salary_usd_k	median_salary_usd_k
role_title	level
AI Research Scientist	Entry	3	$140K	$170K	$155K
	Mid	8	$122K	$209K	$166K
	Senior	24	$159K	$251K	$205K
AI Solution Architect	Entry	2	$133K	$266K	$200K
	Mid	7	$120K	$206K	$163K
	Senior	33	$146K	$247K	$196K
AI/ML Engineer	Entry	1	$135K	$251K	$193K
	Mid	11	$129K	$224K	$176K
	Senior	28	$158K	$268K	$213K
MLOps / AI Infrastructure Engineer	Entry	2	$152K	$260K	$206K
	Mid	5	$143K	$214K	$178K
	Senior	16	$164K	$258K	$211K

Looking across all the four GenAI roles (AI Research, Solutions Architecture, ML Engineering, and MLOps), compensation bands are similar. At senior levels (where I have the largest samples), median salaries cluster tightly between ~$195K-$210K USD. Mid-level positions show medians between ~$165K-$180K USD, and entry-level positions generally start between ~$155K-$205K USD, although this entry-level data is limited by small samples.

This consistency suggests these roles are valued similarly in the market, despite their different focuses.

Job responsibilities

Next, I’m interested to see what the job responsibilities are for some of these roles, and how they differ. I’ll use an LLM to identify the most common responsibilities and skills for a given role. This task will be quite complex, involving a lot of input data. So a model like o1 will be more appropriate. I don’t have access to o1 via API, so I’ll use the web interface to generate completions, and prepare XML files for the inputs to this task – allowing me to copy-paste the XML into the chat.

As we’ll see, Research Scientists lean heavily into discovery, theoretical advances, publication, and cutting-edge experimentation, whereas ML Engineers center on production-grade systems, robust architecture, MLOps, and aligning with business needs. Although in many cases, the lines between these roles will be blurred.

Click to view the code for XML generation

import logging
import os
from typing import Dict
import pandas as pd

INSTRUCTION_PREFIX = """
Identify the most common Responsibilities and Skills listed in the jobs below, returning a bulleted list in the format

# Responsibilities
- [responsibility]: [responsibility description]
- [responsibility]: [responsibility description]
...
# Skills
- [skill]: [skill description]
- [skill]: [skill description]
...
"""

class XMLGenerator:
    """Handles generation of XML files from job data"""
    
    def __init__(self, output_dir: str = 'xml_roles'):
        self.output_dir = output_dir
        os.makedirs(output_dir, exist_ok=True)
        
        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s - %(levelname)s - %(message)s'
        )
    
    def compile_xml_for_role(self, level: str, role_title: str, df: pd.DataFrame) -> Dict:
        """Generate XML-formatted string of job details for the given level and role title."""
        # Filter DataFrame for matching level and role
        filtered_df = df[
            (df['level'] == level) & 
            (df['role_title'] == role_title)
        ]
        
        if len(filtered_df) == 0:
            return {
                'level': level,
                'role_title': role_title,
                'xml_output': '',
                'error': f"No jobs found for {level} {role_title}"
            }
        
        # Build XML string
        xml_output = ""
        for _, job in filtered_df.iterrows():
            xml_output += "<Job>\n"
            xml_output += f"<Responsibilities>{job['responsibilities']}</Responsibilities>\n"
            xml_output += f"<Skills>{job['skills']}</Skills>\n"
            xml_output += "</Job>\n"
        
        return {
            'level': level,
            'role_title': role_title,
            'xml_output': INSTRUCTION_PREFIX + xml_output
        }
    
    def save_xml(self, level: str, role_title: str, xml_data: Dict) -> None:
        """Save XML data to a file"""
        if not xml_data['xml_output']:
            #logging.warning(xml_data.get('error', 'Empty XML output'))
            return
            
        # Clean and sanitize the filename components
        cleaned_level = level.replace('-Level / Level', '').strip()
        safe_level = cleaned_level.replace('/', '_').replace(' ', '_')
        safe_role = role_title.replace('/', '_').replace(' ', '_')
        filename = os.path.join(self.output_dir, f"{safe_level}_{safe_role}.xml")
        
        with open(filename, 'w') as f:
            f.write(xml_data['xml_output'])
        logging.debug(f"Saved XML for {cleaned_level} {role_title} to {filename}")
    
    def generate_all_xml(self, df: pd.DataFrame) -> None:
        """Generate XML files for all role/level combinations"""
        if df.empty:
            logging.warning("No jobs found in DataFrame - check if data was loaded correctly")
            return
            
        required_cols = ['level', 'role_title', 'responsibilities', 'skills']  # Changed here too
        if not all(col in df.columns for col in required_cols):
            logging.error(f"DataFrame missing required columns: {required_cols}")
            return
        
        # Process each unique role/level combination
        role_combinations = df[['level', 'role_title']].drop_duplicates()
        for _, combo in role_combinations.iterrows():
            xml_data = self.compile_xml_for_role(
                combo['level'], 
                combo['role_title'], 
                df
            )
            self.save_xml(combo['level'], combo['role_title'], xml_data)

xml_generator = XMLGenerator()
xml_generator.generate_all_xml(df)

o1-generated Senior AI Research Scientist

Here’s what o1 said about the Responsibilities and Skills observed in Senior AI Research Scientist roles:

Responsibilities

Conduct advanced AI research
Many roles require pushing the state-of-the-art in Generative AI, LLMs, and related areas (e.g., video/multimodal models, diffusion models) through novel algorithms, architectures, and experimental studies.
Train and fine-tune large-scale models
Commonly involves working with massive datasets and distributed training setups (thousands of GPUs, HPC environments) to develop foundation models and advanced AI systems.
Develop and implement new algorithms or architectures
Spans designing novel model architectures (e.g., diffusion, transformer-based, multimodal fusion) and creating robust data processing or simulation pipelines to support AI solutions.
Collaborate with cross-functional teams
Emphasizes close work with engineering, product management, research, and external stakeholders to integrate AI breakthroughs into real-world applications and products.
Evaluate and measure AI performance
Entails building rigorous evaluation frameworks, designing new metrics, and systematically analyzing model behavior to ensure quality and reliability.
Publish and communicate research findings
Many positions highlight writing influential papers, presenting at conferences, and sharing innovative results both internally and with the broader AI community.
Build and maintain data pipelines
Involves constructing high-quality, scalable data pipelines or tooling to support training, fine-tuning, and inference of large models.
Ensure production-grade implementation
Requires writing clean, efficient, and maintainable code as well as optimizing models and pipelines to meet performance, reliability, and quality standards.

Skills

Proficiency in Python and deep learning frameworks
Strong coding skills in Python and hands-on experience with libraries such as PyTorch, TensorFlow, or JAX appear in nearly every role.
Expertise with LLMs and Generative AI
Deep understanding of transformer architectures, diffusion models, multimodal systems, prompt engineering, and other advanced AI techniques is frequently mentioned.
Experience with large-scale/distributed training
Many roles emphasize knowledge of HPC, GPU optimization, model parallelism (e.g., FSDP, DeepSpeed, Megatron-LM), and handling massive datasets.
Strong software engineering practices
Testing, code review, debugging, version control, and producing clean, modular research or production code are consistently important.
Collaboration and communication skills
Clear written and verbal communication, along with cross-functional teamwork, is vital for integrating AI solutions into products and relaying complex ideas.
Research acumen and adaptability
Ability to read, interpret, and prototype cutting-edge AI literature, publish findings, and rapidly iterate on experiments.
Machine Learning fundamentals
Solid grounding in ML theory (e.g., optimization, statistics, data structures) and experience with model evaluation, data manipulation, and pipeline design.
Familiarity with prompt engineering and advanced NLP concepts
Many roles highlight crafting effective prompts, aligning model outputs with user needs, and leveraging text-generation or conversational AI techniques.

o1-generated Senior AI ML Engineer

And here’s what o1 said about the Responsibilities and Skills observed in Senior AI ML Engineer roles:

Responsibilities

Design, develop, and deploy AI/ML solutions
End-to-end creation of machine learning systems, from initial prototypes to production-ready deployments.
Collaborate with cross-functional teams
Work closely with product managers, data scientists, engineers, and other stakeholders to align technical solutions with business goals.
Monitor and optimize model performance
Track key metrics, fine-tune models, and iterate to ensure continuous improvement and reliability in production.
Stay current with AI research and emerging technologies
Keep up-to-date with the latest breakthroughs in areas like LLMs, generative AI, and deep learning.
Mentor and coach team members
Provide guidance on best practices, design patterns, code quality, and career development for junior or peer engineers.
Develop scalable data/ML pipelines
Build robust infrastructure for data collection, preprocessing, model training, and deployment at scale.
Implement and maintain CI/CD and coding best practices
Ensure code quality, streamline release processes, and enforce testing discipline for AI/ML components.
Integrate and leverage LLMs/generative AI
Incorporate large language models or generative methods into products and workflows.
Prototype and experiment
Conduct R&D, proof-of-concepts, and pilot programs to explore emerging AI techniques and validate new product ideas.
Document and communicate findings
Produce clear technical documentation, share results with stakeholders, and provide actionable insights for decision-making.

Skills

Proficiency in Python
Commonly required for AI/ML development, data manipulation, and scripting.
Experience with ML/DL frameworks
Hands-on expertise in tools like PyTorch, TensorFlow, or JAX for building and training models.
Familiarity with cloud platforms
Working knowledge of AWS, GCP, or Azure for deploying and scaling AI solutions.
Expertise in LLMs/generative AI
Understanding of transformer architectures, prompt engineering, retrieval-augmented generation (RAG), and related libraries.
Strong software engineering fundamentals
Solid grasp of algorithms, data structures, design patterns, and best practices for production code.
Knowledge of MLOps and CI/CD
Experience with containerization (Docker, Kubernetes), version control (Git), and automated testing/monitoring.
Data processing and SQL
Skills in handling large datasets, working with Spark or similar frameworks, and writing performant SQL queries.
Effective communication and collaboration
Ability to translate complex technical concepts for non-technical stakeholders and work well in diverse teams.
Problem-solving and debugging
Track record of diagnosing issues in production environments and implementing reliable fixes.
Continuous learning mindset
Eagerness to stay on top of new AI research, frameworks, and technologies to innovate and improve solutions.

Job titles

What are the most common job titles for these roles?

Click to view the code for aggregating titles

def aggregate_titles_by_role_level(df: pd.DataFrame, output_dir: str = 'role_titles') -> None:
    """
    Aggregate job titles for each role_title and level combination and save to JSON.
    
    Args:
        df: DataFrame containing 'role_title', 'level', and 'title' columns
        output_dir: Directory to save the JSON output
    """
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Group by role_title and level
    grouped = df.groupby(['role_title', 'level'])
    
    for (role, level), group in grouped:
        # Get unique titles for this combination
        result = {
            'titles': group['title'].unique().tolist()
        }
        
        # Create safe filename from role and level
        safe_role = role.replace('/', '_').replace(' ', '_')
        safe_level = level.replace('/', '_').replace(' ', '_')
        filename = f"{safe_level}_{safe_role}.json"
        
        # Save to JSON file
        with open(os.path.join(output_dir, filename), 'w') as f:
            json.dump(result, f, indent=2)
            
        print(f"Saved titles for {level} {role}")

aggregate_titles_by_role_level(df)

Saved titles for Entry-level AI Research Scientist
Saved titles for Mid-Level / Level AI Research Scientist
Saved titles for Senior-Level / Level AI Research Scientist
Saved titles for Entry-level AI Solution Architect
Saved titles for Mid-Level / Level AI Solution Architect
Saved titles for Senior-Level / Level AI Solution Architect
Saved titles for Entry-level AI/ML Engineer
Saved titles for Mid-Level / Level AI/ML Engineer
Saved titles for Senior-Level / Level AI/ML Engineer
Saved titles for Entry-level MLOps / AI Infrastructure Engineer
Saved titles for Mid-Level / Level MLOps / AI Infrastructure Engineer
Saved titles for Senior-Level / Level MLOps / AI Infrastructure Engineer

I’ll use o1 again to generate a summary of the most common titles for each role/level combination. In the prompt, I will omit my own role classification so as not to bias the results. I’ll ask it to “identify the most common titles listed below (ignoring slight variations), and then for those titles identify how often they occurred.”

Here’s a sample of what it said.

For Senior AI Research Scientist roles, o1 observed the following titles to be common to this role:

Research Scientist/Researcher (14 occurrences)
Research Engineer (7 occurrences)
Machine Learning Engineer (3 occurrences)
Generative AI Engineer (2 occurrences)
Software Engineer (Generative AI) (2 occurrences)

For Senior AI ML Engineer roles, o1 observed the following titles to be common to this role:

Senior Software Engineer (6 occurrences)
Senior Machine Learning Engineer (5 occurrences)
Staff Software Engineer (5 occurrences)
Senior AI Engineer (4 occurrences)
Machine Learning Researcher (3 occurrences)
Staff Machine Learning Engineer (2 occurrences)
Principal Software Engineer (2 occurrences)

--- title: "Analysis of the GenAI/LLM job market in January 2025" date: 2025-01-24 description: "A data-driven exploration of the GenAI/LLM job market for science and engineering roles in January 2025. I scrape ~1000 job postings from ai-jobs.net, perform data extraction and classification using LLMs, and then analyze the data to identify patterns and insights about the GenAI/LLM job market, including salary ranges, skill requirements, and role distribution." categories: - prompt-engineering - python - job-market - GenAI - LLM freeze: true --- # Introduction As someone currently working in a "Prompt Engineering" role, I've been thinking a lot about how this title basically doesn't exist outside of a handful of companies, how the title communicates a narrow range of skills and responsibilities,and how the work that I do day-to-day is much larger in scope than just writing prompts. I identify more as an AI Engineer or AI Research Scientist, and so I was interested to see what I could learn about other similar roles that work with GenAI and LLMs. So with that motivation, I set about to collect some data and look at the job market for these sorts of roles. What responsibilities and skills are being advertised most often, what kinds of titles are being used for these roles, and what kind of compensation is being offered at different levels of seniority? To accomplish this, I built a custom web scraper to collect ~1000 job postings from ai-jobs.net. My code gathers job details like the title, company, location, salary, and posting date -- from a range of U.S. and Canadian cities, covering entry-level, mid-level, and senior positions. I then use a Large Language Model (LLM) to extract each job's key responsibilities, required skills, and qualifications. Afterwards, I classify each position to see whether it involves working with Generative AI or large language models, and if so, categorize it further into four major AI roles: (1) AI Research Scientist, (2) AI/ML Engineer, (3) MLOps/AI Infrastructure Engineer, and (4) AI Solution Architect. I believe this set of roles is a good representation of different areas of focus. Finally, I integrate the various metadata and classifications into a comprehensive dataset. I observe that GenAI/LLM positions command consistently high salary ranges across the four different roles, particularly at more senior levels. Senior-level roles tended to offer median salaries in the $195K–$210K range, while mid-level roles generally clustered around $165K–$180K. Entry-level salaries showed greater variation (likely due to the small sample size) but still landed in competitive ranges of roughly $155K–$205K in many postings. These roles often share common technical demands—like proficiency with large-scale model training, distributed computing, and LLM-specific knowledge—though each role emphasizes distinct priorities (research vs. production, for example). Of course, this analysis is not without limitations. For example, I am relying on a single job board, and I scraped jobs during a limited window of time. I also have not rigorously validated the LLM classifications -- although I have implemented many prompt engineering best practices, and have used some of the more powerful LLMs (4o and o1). To some extent, the responsibilities and skills that were recovered from the original job postings classified into the four pre-defined roles do speak to the relative accuracy of the role classifications. The salary analysis also does not distinguish between base salary and total compensation, remote vs. in-person opportunities, large vs. small companies, and so on. But overall, I think this gives some sense of the job market for these roles. # Scraping job postings I'll scrape job postings from ai-jobs.net. This is an aggregator that specializes in AI jobs of all kinds, sourcing jobs from over 60 countries, and in my experience, it does a pretty good job. The code below implements a web scraper for job postings from ai-jobs.net. It collects job postings from major cities in the US and Canada, searching across entry-level, mid-level, and senior positions. For each job, it gets the title, company, location, salary, description, and posting date. The scraper saves each job as a JSON file and keeps track of what it has already scraped to avoid duplicates. It includes error handling and logging to track any problems that occur during scraping. ```{python} #| code-fold: true #| code-fold-show: false #| code-summary: "Click to view the Job Posting Scraper code" from dataclasses import dataclass from typing import List, Dict, Optional, Set from datetime import datetime, timedelta import logging import json import requests from bs4 import BeautifulSoup from pathlib import Path import re import time @dataclass class JobData: """Structured container for job posting data""" title: str company: str location: str level: Optional[str] salary: Optional[str] url: str description: str scraped_date: str posted_date: Optional[str] raw_data: str class ScrapeConfig: """Configuration settings for the scraper""" EXPERIENCE_LEVELS = ['EN', 'MI', 'SE'] # Entry, Mid, Senior CITIES = { '5391959': 'San Francisco', '5128581': 'New York City', '6167865': 'Toronto', '6173331': 'Vancouver', '5809844': 'Seattle', '4671654': 'Austin', '4930956': 'Boston', '5': 'Region' } CATEGORIES = { '1': 'Research', '2': 'Engineering', '18': 'GenerativeAI' } BASE_URL = "https://ai-jobs.net" class SalaryExtractor: """Handles salary extraction logic""" @staticmethod def extract_from_schema(schema_data: Dict) -> Optional[str]: """Extract salary from schema.org data""" try: if schema_data and 'baseSalary' in schema_data: base_salary = schema_data.get('baseSalary', {}).get('value', {}) if base_salary: min_value = base_salary.get('minValue') max_value = base_salary.get('maxValue') currency = base_salary.get('currency', 'USD') # Default to USD if not specified if min_value and max_value: return f"{currency} {float(min_value)/1000:.0f}K - {float(max_value)/1000:.0f}K" elif min_value: return f"{currency} {float(min_value)/1000:.0f}K+" except Exception as e: logging.debug(f"Error extracting schema salary: {e}") return None @staticmethod def extract_from_text(description: str) -> Optional[str]: """Extract salary from description text""" patterns = [ r'(?:salary range.*?)(?:CAD|\$)?([\d,]+)\s*-\s*(?:CAD|\$)?([\d,]+)', r'(?:salary.*?)(?:CAD|\$)?([\d,]+)(?:\s*-\s*(?:CAD|\$)?([\d,]+))?' ] for pattern in patterns: if match := re.search(pattern, description, re.IGNORECASE): try: groups = match.groups() # Check if salary is in CAD currency = 'CAD' if 'CAD' in description.upper() else 'USD' if len(groups) == 2 and groups[1]: # Range format min_sal = float(groups[0].replace(',', '')) max_sal = float(groups[1].replace(',', '')) return f"{currency} {min_sal/1000:.0f}K - {max_sal/1000:.0f}K" elif groups[0]: # Single value format base_sal = float(groups[0].replace(',', '')) return f"{currency} {base_sal/1000:.0f}K+" except ValueError: continue return None @classmethod def extract(cls, soup: BeautifulSoup, description: str, schema_data: Dict) -> Optional[str]: """Main salary extraction method""" # Try schema data first if salary := cls.extract_from_schema(schema_data): return salary # Try description text if salary := cls.extract_from_text(description): return salary # Try salary badge if salary_badge := soup.find('span', class_='badge rounded-pill text-bg-success'): salary_text = salary_badge.text.strip() if re.search(r'(USD|\$|\d)', salary_text): return salary_text return None class JobPageParser: """Handles parsing of individual job pages""" def __init__(self, html: str): self.soup = BeautifulSoup(html, 'html.parser') self.raw_html = html def parse_schema_data(self) -> Dict: """Parse schema.org JSON-LD data""" if script := self.soup.find('script', type='application/ld+json'): try: cleaned_script = re.sub(r'[\x00-\x1F\x7F-\x9F]', '', script.string) return json.loads(cleaned_script) except Exception as e: logging.warning(f"Could not parse JSON-LD data: {e}") return {} def parse_job_data(self, url: str) -> JobData: """Extract all job data from the page""" schema_data = self.parse_schema_data() scraped_date = datetime.now().isoformat() # Get company name from schema or fallback to page element company = schema_data.get('hiringOrganization', {}).get('name') if not company: if company_elem := self.soup.find('a', class_='company-name'): company = company_elem.text.strip() # Get job level from meta description level = None if meta_desc := self.soup.find('meta', {'name': 'description'}): if level_match := re.search(r'a ([\w-]+level)', meta_desc['content']): level = level_match.group(1).replace('-', '-level / ').title() return JobData( title=self.soup.find('h1', class_='display-5').text.strip(), company=company, location=self.soup.find('h3', class_='lead').text.strip(), level=level, salary=SalaryExtractor.extract( self.soup, self.soup.find('div', class_='job-description-text').text.strip(), schema_data ), url=url, description=self.soup.find('div', class_='job-description-text').text.strip(), scraped_date=scraped_date, posted_date=self.calculate_posted_date(scraped_date), raw_data=self.raw_html ) def calculate_posted_date(self, scraped_date_str: str) -> Optional[str]: """Calculate posting date from 'posted X time ago' text""" if match := re.search(r'Posted (\d+) (hours?|days?|weeks?|months?) ago', self.raw_html): number = int(match.group(1)) unit = match.group(2) scraped_date = datetime.fromisoformat(scraped_date_str) delta = { 'hour': timedelta(hours=number), 'day': timedelta(days=number), 'week': timedelta(weeks=number), 'month': timedelta(days=number * 30) # Approximate }.get(unit.rstrip('s')) if delta: return (scraped_date - delta).isoformat() return None class AIJobScraper: """Main scraper orchestration class""" def __init__(self, output_dir: str = 'json_data'): self.output_dir = Path(output_dir) self.output_dir.mkdir(exist_ok=True) self.session = requests.Session() self.config = ScrapeConfig() self.existing_jobs = self._load_existing_jobs() def _load_existing_jobs(self) -> Set[str]: """Load set of existing job URLs""" existing_jobs = set() for f in self.output_dir.glob('*.json'): try: data = json.loads(f.read_text()) existing_jobs.add(data['url']) except Exception as e: logging.warning(f"Error reading {f}: {e}") return existing_jobs def _generate_search_urls(self) -> List[str]: """Generate all search URL combinations""" urls = [] for exp in self.config.EXPERIENCE_LEVELS: for cat in self.config.CATEGORIES: for city in self.config.CITIES: url = f"{self.config.BASE_URL}/?" url += f"cat={cat}" url += f"&{'reg' if city == '5' else 'cit'}={city}" url += f"&typ=1&key=&exp={exp}&sal=" urls.append(url) return urls def get_job_urls(self, search_url: str) -> List[str]: """Get all job URLs from a search page""" try: response = self.session.get(search_url) response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') if job_list := soup.find('ul', id='job-list'): return [ f"{self.config.BASE_URL}{link['href']}" for link in job_list.find_all('a', href=lambda h: h and '/job/' in h) ] except Exception as e: logging.error(f"Error fetching job URLs: {e}") return [] def scrape_job_details(self, url: str) -> Optional[JobData]: """Scrape details from a job page""" try: response = self.session.get(url) response.raise_for_status() parser = JobPageParser(response.text) return parser.parse_job_data(url) except Exception as e: logging.error(f"Error scraping job {url}: {e}") return None def save_job(self, job: JobData) -> None: """Save job data to JSON file""" try: safe_title = re.sub(r'[^\w\s-]', '', job.title) safe_company = re.sub(r'[^\w\s-]', '', job.company or 'unknown') filename = f"{safe_company}_{safe_title}_{hash(job.url)}.json" self.output_dir.joinpath(filename).write_text( json.dumps(vars(job), indent=2, ensure_ascii=False) ) logging.info(f"Saved job: {job.title} at {job.company}") except Exception as e: logging.error(f"Error saving job: {e}") raise def scrape_jobs(self, test: bool = False) -> int: """Main scraping method""" jobs_found = jobs_skipped = 0 for search_url in self._generate_search_urls(): logging.debug(f"Processing search URL: {search_url}") job_urls = self.get_job_urls(search_url) if test and job_urls: job_urls = job_urls[:1] for url in job_urls: if url in self.existing_jobs: jobs_skipped += 1 continue if job_data := self.scrape_job_details(url): self.save_job(job_data) self.existing_jobs.add(url) jobs_found += 1 time.sleep(1) # Rate limiting logging.info(f"Found {jobs_found} new jobs. Skipped {jobs_skipped} existing jobs.") return {'jobs_found': jobs_found, 'jobs_skipped': jobs_skipped} def scrape_jobs(test: bool = False, verbose: bool = False) -> int: """Convenience function to run the scraper""" if not verbose: logging.getLogger().setLevel(logging.WARNING) try: scraper = AIJobScraper(output_dir=Path.cwd() / 'json_data') return scraper.scrape_jobs(test=test) except Exception as e: logging.error(f"Scraper failed: {e}") raise ``` ```{python} #logging.basicConfig(level=logging.INFO) #scraper = AIJobScraper() #scrape_results = scraper.scrape_jobs(test=False) #print(f"Found {scrape_results['jobs_found']} new jobs, skipped {scrape_results['jobs_skipped']} existing jobs") ``` # General-purpose LLM completion function For the classification tasks, I'll write a general-purpose LLM completion function. This function takes a prompt and a model as parameters, and returns the completion from the OpenAI API. ```{python} from openai import AsyncOpenAI import os async def get_completion(prompt: str, model: str = "gpt-4o-2024-08-06") -> str: """Get a completion from the OpenAI API.""" client = AsyncOpenAI(api_key=os.getenv('OPENAI_API_KEY')) # Initialize with API key from env vars response = await client.chat.completions.create( model=model, messages=[{"role": "system", "content": prompt}], temperature=0 ) return response.choices[0].message.content ``` # Extract Responsibilities, Skills, and Qualifications The code below uses an LLM to extract key information from job postings that will be used to classify them into different role categories. It uses OpenAI's GPT-4o model to analyze job descriptions and extract three key components: responsibilities, skills, and qualifications. The system processes multiple jobs concurrently with rate limiting, saves the extracted data as JSON files, and includes retry logic for handling API rate limits. The code uses `asyncio` for concurrent processing and includes error handling and logging. It also checks for previously processed jobs to avoid duplicate work. Also included is code that implements data preprocessing steps. This code has functions to load JSON files into a pandas DataFrame, uses a list of keywords to filter out certain jobs that are not relevant to the analysis, and removes duplicate or highly similar job descriptions using TF-IDF vectorization and cosine similarity. The main function `process_job_listings()` combines these steps, taking a directory path, similarity threshold, and filter keywords as parameters. It returns a dictionary containing the processed DataFrame along with counts of the original and filtered entries. ```{python} #| code-fold: true #| code-fold-show: false #| code-summary: "Click to view the code for LLM extraction" import os import json import logging import asyncio import hashlib import nest_asyncio from typing import Dict, List, Optional import pandas as pd from bs4 import BeautifulSoup from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type from openai import AsyncOpenAI, RateLimitError from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity import re # Enable nested event loops for Jupyter notebooks nest_asyncio.apply() # Set up logging configuration logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s' ) PROMPTS = { "prompt": """ You will be given a Job Description. Your task is to extract a list of Responsibilities, Skills, and Qualifications. - Responsibilities are the tasks and activities that the job requires the employee to perform. - Skills are the abilities and knowledge that the employee needs to have to perform the responsibilities. - Qualifications are the requirements that the employee needs to meet to be considered for the job. <JobDescription> {job_description} </JobDescription> Return a list of Responsibilities, Skills, and Qualifications as follows: <Responsibilities> [Bullet point list of responsibilities] </Responsibilities> <Skills> [Bullet point list of skills] </Skills> <Qualifications> [Bullet point list of qualifications] </Qualifications> """ } class JobProcessor: def __init__(self, output_dir='json_extracted_data'): self.output_dir = output_dir os.makedirs(output_dir, exist_ok=True) self.semaphore = asyncio.Semaphore(3) self.processed = 0 self.skipped = 0 def _parse_llm_response(self, response: str) -> Dict[str, List[str]]: """Parse LLM response into structured data.""" result = { "responsibilities": [], "skills": [], "qualifications": [] } sections = { "responsibilities": r'<Responsibilities>\n(.*?)\n</Responsibilities>', "skills": r'<Skills>\n(.*?)\n</Skills>', "qualifications": r'<Qualifications>\n(.*?)\n</Qualifications>' } for section, pattern in sections.items(): if match := re.search(pattern, response, re.DOTALL): result[section] = [ item.strip('- ') for item in match.group(1).strip().split('\n') if item.strip('- ') # Filter out empty items ] return result async def extract_job_details(self, job_id: str, job_description: str) -> Dict[str, List[str]]: """Extract structured information from job description using LLM.""" try: prompt = PROMPTS["prompt"].format(job_description=job_description) response = await get_completion(prompt) result = self._parse_llm_response(response) # Save results json_path = os.path.join(self.output_dir, f'{job_id}.json') with open(json_path, 'w', encoding='utf-8') as f: json.dump(result, f, indent=2, ensure_ascii=False) return result except Exception as e: logging.error(f"Error extracting job details: {str(e)}") return {"responsibilities": [], "skills": [], "qualifications": []} async def process_single_job(self, job_id: str, description: str) -> Dict[str, List[str]]: """Process a single job with caching.""" json_path = os.path.join(self.output_dir, f'{job_id}.json') if os.path.exists(json_path): self.skipped += 1 logging.debug(f"Skipping job ID {job_id} - already processed") with open(json_path, 'r', encoding='utf-8') as f: return json.load(f) async with self.semaphore: self.processed += 1 logging.info(f"Processing job ID: {job_id}") return await self.extract_job_details(job_id, description) async def process_jobs(self, df: pd.DataFrame) -> List[Dict[str, List[str]]]: """Process multiple jobs concurrently.""" tasks = [ self.process_single_job(str(row['id']), row['description']) for _, row in df.iterrows() ] results = await asyncio.gather(*tasks) print(f"\nProcessing complete:") print(f"- New jobs processed: {self.processed}") print(f"- Skipped jobs (already processed): {self.skipped}") return results class DataPreprocessor: def __init__(self, similarity_threshold=0.8): self.similarity_threshold = similarity_threshold self.default_filter_keywords = [ 'Dir ', 'Director', 'Intern', 'Data Scientist', 'Data Science', 'Content Writer', 'Faculty', 'Product Owner', 'Manager', 'Analyst', 'Postdoctoral', 'Postdoc', 'Summer' ] @staticmethod def load_json_files(directory='json_data') -> pd.DataFrame: """Load JSON files into DataFrame with error handling.""" df = pd.DataFrame() if not os.path.exists(directory): logging.error(f"Directory {directory} does not exist") return df for file in [f for f in os.listdir(directory) if f.endswith('.json')]: try: with open(os.path.join(directory, file), 'r') as f: data = json.load(f) df = pd.concat([df, pd.DataFrame([data])], ignore_index=True) except Exception as e: logging.error(f"Error loading {file}: {e}") if not df.empty: df['id'] = df['url'].apply( lambda x: f"j{hashlib.md5(x.encode()).hexdigest()[:5]}" ) return df def filter_by_title(self, df: pd.DataFrame, filter_keywords: Optional[List[str]] = None) -> pd.DataFrame: """Filter DataFrame by job titles.""" if df.empty: return df keywords = filter_keywords or self.default_filter_keywords return df[~df['title'].str.contains( '|'.join(keywords), case=False, na=False )] def remove_similar_descriptions(self, df: pd.DataFrame) -> pd.DataFrame: """Remove similar job descriptions using TF-IDF and cosine similarity.""" if df.empty: return df tfidf = TfidfVectorizer(stop_words='english') tfidf_matrix = tfidf.fit_transform(df['description'].fillna('')) cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix) indices_to_remove = { j for i in range(len(df)) for j in range(i + 1, len(df)) if cosine_sim[i][j] > self.similarity_threshold } return df.iloc[~df.index.isin(list(indices_to_remove))] def extract_jobs_from_dataframe(df: pd.DataFrame) -> List[Dict[str, List[str]]]: """Wrapper function to process jobs from a DataFrame.""" processor = JobProcessor() loop = asyncio.get_event_loop() if loop.is_running(): return loop.create_task(processor.process_jobs(df)) else: return asyncio.run(processor.process_jobs(df)) def process_job_listings(directory='json_data', similarity_threshold=0.8, filter_keywords=None) -> Dict: """Main function to process job listings.""" preprocessor = DataPreprocessor(similarity_threshold) # Load and preprocess data df = preprocessor.load_json_files(directory) initial_count = len(df) if df.empty: return { 'processed_data': df, 'original_count': 0, 'filtered_count': 0 } df = preprocessor.filter_by_title(df, filter_keywords) df_filtered = preprocessor.remove_similar_descriptions(df) return { 'processed_data': df_filtered, 'original_count': initial_count, 'filtered_count': len(df_filtered) } ``` ```{python} results = process_job_listings() df = results['processed_data'] print(f'Original count of listings: {results["original_count"]}') print(f'Filtered count of listings: {results["filtered_count"]}') extract_jobs_from_dataframe(df) ``` # Classification ## Classify jobs as relevant to GenAI/LLM work or not ```{python} #| code-fold: true #| code-fold-show: false #| code-summary: "Click to view the code for job classification as GenAI/LLM relevant or not" import nest_asyncio import re import asyncio import logging from typing import List, Dict, Set from bs4 import BeautifulSoup import pandas as pd from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type from openai import AsyncOpenAI import os from openai import RateLimitError import json class JobClassifierGenAI: """Handles classification of jobs for GenAI/LLM work""" PROMPT = """ You will be given a list of Responsibilities and Skills listed for a job. Your task is to determine if the job involves working with Generative AI (GenAI) or language models (a.k.a. Large Language Models (LLMs)). <Job> <Responsibilities> {responsibilities} </Responsibilities> <Skills> {skills} </Skills> </Job> Start by thinking step-by-step about the Job and its Responsibilities and Skills, and whether it involves working with Generative AI (GenAI) or language models (a.k.a. Large Language Models (LLMs)). Return your response in the following format: <Analysis> [Your analysis of the job and its Responsibilities and Skills] </Analysis> <FinalAnswer> true|false </FinalAnswer> """ def __init__(self, output_dir='role_genai_classifications', batch_size=3): self.output_dir = output_dir self.batch_size = batch_size self.semaphore = asyncio.Semaphore(batch_size) os.makedirs(output_dir, exist_ok=True) async def classify_job(self, job: Dict) -> Dict: """Classify a single job listing""" logging.info(f"Classifying job '{job['filename']}'") prompt = self.PROMPT.format( responsibilities=job['responsibilities'], skills=job.get('skills', '') ) try: response = await get_completion(prompt) await asyncio.sleep(2) # Rate limiting except Exception as e: logging.error(f"Failed to classify job '{job['filename']}': {str(e)}") return { **job, 'analysis': f'Failed to process: {str(e)}', 'is_genai_role': None } return self._parse_response(job, response) def _parse_response(self, job: Dict, response: str) -> Dict: """Parse LLM response into structured format""" soup = BeautifulSoup(f"<root>{response}</root>", 'lxml-xml') analysis = soup.find('Analysis') final_answer = soup.find('FinalAnswer') is_genai_role = None if final_answer: answer_text = final_answer.text.strip().lower() is_genai_role = True if answer_text == 'true' else False if answer_text == 'false' else None return { **job, 'analysis': analysis.text.strip() if analysis else '', 'is_genai_role': is_genai_role } def save_classification(self, job_id: str, result: Dict) -> None: """Save classification results to file""" filename = os.path.join(self.output_dir, f"{job_id}.json") with open(filename, 'w') as f: json.dump(result, f, indent=2) logging.info(f"Saved classification for job {job_id}") def get_classified_jobs(self) -> Set[str]: """Get set of already classified job IDs""" if not os.path.exists(self.output_dir): return set() return {f[:-5] for f in os.listdir(self.output_dir) if f.endswith('.json')} async def process_jobs_batch(self, jobs: List[Dict]) -> None: """Process a batch of jobs concurrently""" async def process_with_semaphore(job: Dict) -> None: async with self.semaphore: result = await self.classify_job(job) job_id = str(result['filename']) self.save_classification(job_id, result) await asyncio.gather(*[process_with_semaphore(job) for job in jobs]) async def classify_jobs_async(self, df: pd.DataFrame) -> None: """Process all unclassified jobs in the DataFrame""" total_jobs = len(df) logging.info(f"Starting classification of {total_jobs} jobs") classified_jobs = self.get_classified_jobs() jobs_to_process = [ job.to_dict() for idx, job in df.iterrows() if str(job['filename']) not in classified_jobs ] await self.process_jobs_batch(jobs_to_process) logging.info(f"Completed classification of all {total_jobs} jobs") def classify_jobs(self, df: pd.DataFrame) -> None: """Main entry point for job classification""" if df.empty: logging.warning("Empty DataFrame provided") return if 'filename' not in df.columns: logging.error("DataFrame missing required 'filename' column") return classified_jobs = self.get_classified_jobs() logging.info(f"Found {len(classified_jobs)} previously classified jobs") new_jobs = df[~df['filename'].isin(classified_jobs)] if new_jobs.empty: logging.info("No new jobs to classify") return logging.info(f"Processing {len(new_jobs)} new jobs") logging.info(f"Skipping {len(df) - len(new_jobs)} existing jobs") loop = asyncio.get_event_loop() loop.run_until_complete(self.classify_jobs_async(new_jobs)) class JobDataLoader: """Handles loading and preprocessing of job data""" @staticmethod def read_json_files(json_dir='json_extracted_data') -> List[Dict]: """Read job data from JSON files""" result = [] for filename in os.listdir(json_dir): if filename.endswith('.json'): file_path = os.path.join(json_dir, filename) try: with open(file_path, 'r') as f: data = json.load(f) name = filename[:-5] if 'responsibilities' in data and 'skills' in data: result.append({ 'filename': name, 'responsibilities': data['responsibilities'], 'skills': data['skills'] }) except json.JSONDecodeError: logging.error(f"Invalid JSON in {filename}") except Exception as e: logging.error(f"Error processing {filename}: {str(e)}") return result @staticmethod def load_classifications(input_dir='role_genai_classifications') -> pd.DataFrame: """Load classification results into DataFrame""" if not os.path.exists(input_dir): return pd.DataFrame() all_results = [] for filename in os.listdir(input_dir): if filename.endswith('.json'): with open(os.path.join(input_dir, filename), 'r') as f: classification = json.load(f) all_results.append(classification) return pd.DataFrame(all_results) ``` ```{python} loader = JobDataLoader() jobs = loader.read_json_files() df = pd.DataFrame(jobs) #classifier = JobClassifierGenAI() #classifier.classify_jobs(df) # Load results #results_df = loader.load_classifications() ``` ## Classify GenAI/LLM jobs into pre-defined categories For this next classification task, I'll make the assumption that there are four types of AI engineering and science roles that are relevant to work with GenAI systems. These are: 1. AI Research Scientist 2. AI/ML Engineer 3. MLOps / AI Infrastructure Engineer 4. AI Solution Architect I'll also include other categories that are not of interest, but may improve classification accuracy. These are: 5. Data Scientist 6. Data Engineer 7. Product Manager 8. Software Engineer The code below implements an automated job classification system that uses an LLM to categorize job postings into the eight predefined roles listed above (four GenAI-focused and four related roles). It consists of several classes that work together: JobClassifier handles the core classification logic by comparing job descriptions against detailed role templates, JobData and ClassificationResult provide structured data containers, and JobProcessor manages the overall pipeline from loading jobs to saving results. The system processes jobs concurrently using asyncio, includes error handling and rate limiting, and outputs both an analysis explaining the classification and a final numerical category (0-8) for each job, with all results saved as JSON files for further analysis. Definitions of the roles can be found in the `JOB_DESCRIPTIONS` variable. ```{python} #| code-fold: true #| code-fold-show: false #| code-summary: "Click to view the code for job classification into pre-defined roles" from dataclasses import dataclass from typing import List, Dict, Set, Optional import logging import json import asyncio import os from bs4 import BeautifulSoup import pandas as pd import nest_asyncio JOB_DESCRIPTIONS = """ <Option title="AI Research Scientist" number="1"> <PrimaryFocus> Investigate and adapt cutting-edge AI methodologies (e.g., generative models, advanced prompt engineering) for applications. </PrimaryFocus> <KeyResponsibilities> Conduct experiments to evaluate the performance (e.g., quality, accuracy) of new AI approaches and refine existing models. Collaborate with AI/ML Engineers to transition successful prototypes into production. Stay current with the latest AI research and emerging trends in generative AI. Develop human-annotated datasets for training and evaluation of AI models. </KeyResponsibilities> <SkillsAndTools> Deep understanding of LLMs and prompt engineering. Strong background in statistics, optimization, or related fields. Knowledge of experimental methods (e.g., A/B testing) and hypothesis testing. Knowledge of LLM evaluation methods, including algorithmic evals, human evals, or LLM-as-a-judge evals. </SkillsAndTools> </Option> <Option title="AI/ML Engineer" number="2"> <PrimaryFocus> Transform research output into robust, scalable AI solutions for the product or internal use. </PrimaryFocus> <KeyResponsibilities> Productionize AI models, ensuring they meet performance and reliability requirements. Develop and maintain data pipelines for model training, inference, and monitoring. Collaborate closely with Research Scientists to optimize and refine model implementations. </KeyResponsibilities> <SkillsAndTools> Proficiency in Python, Go, or similar languages. Experience with API development and integration (REST, GraphQL). Working knowledge of software engineering best practices (version control, testing, CI/CD). </SkillsAndTools> </Option> <Option title="MLOps / AI Infrastructure Engineer" number="3"> <PrimaryFocus> Ensure reliable deployment, scaling, and monitoring of AI systems in production. </PrimaryFocus> <KeyResponsibilities> Set up CI/CD pipelines tailored for AI workflows, including model versioning and data governance. Monitor production models for performance, latency, and data drift, implementing necessary updates. Manage infrastructure for scalable AI deployments (Docker, Kubernetes, cloud services). </KeyResponsibilities> <SkillsAndTools> Strong DevOps background, with tools like Docker, Kubernetes, and Terraform. Familiarity with ML orchestration/monitoring tools (MLflow, Airflow, Prometheus). Experience optimizing compute usage (GPU/TPU) for cost-effective scaling. </SkillsAndTools> </Option> <Option title="AI Solution Architect" number="4"> <PrimaryFocus> Design and orchestrate AI solutions leveraging generative models and LLM technologies to create impactful experiences and solutions that align with business objectives. </PrimaryFocus> <KeyResponsibilities> Collaborate with subject matter experts (SMEs) to identify and refine opportunities for generative AI/LLM-based use cases. Assess feasibility and define high-level solution architectures, ensuring they address core business and user requirements. Develop technical proposals and roadmaps, translating complex requirements into actionable plans. Provide thought leadership on conversational design, user experience flow, and model interaction strategies. Ensure solutions comply with relevant data governance, privacy, and security considerations. Facilitate cross-functional collaboration, guiding teams through solution conceptualization and implementation phases. </KeyResponsibilities> <SkillsAndTools> Strong understanding of LLM capabilities and prompt engineering principles. Experience with conversational experience design (e.g., chatbots, voice interfaces) and user journey mapping. Ability to analyze business needs and translate them into feasible AI solution proposals. Familiarity with data privacy and security best practices, especially as they pertain to AI solutions. Excellent communication and stakeholder management skills to align technical and non-technical teams. </SkillsAndTools> </Option> <Option title="Data Scientist" number="5"> <PrimaryFocus> Leverage statistical analysis, machine learning, and data visualization to derive actionable insights and guide data-informed decisions. </PrimaryFocus> <KeyResponsibilities> Perform exploratory data analysis (EDA) to identify trends and patterns in large, complex datasets. Develop and validate predictive and prescriptive models, collaborating with cross-functional teams to implement these solutions. Design and execute experiments to test hypotheses, measure impact, and inform business strategies. Present findings and recommendations to stakeholders in a clear, concise manner using visualizations and dashboards. Work with data engineers to ensure data quality, governance, and availability. </KeyResponsibilities> <SkillsAndTools> Proficiency in Python, R, or SQL for data manipulation and analysis. Experience with common ML libraries (e.g., scikit-learn, XGBoost) and deep learning frameworks (e.g., PyTorch, TensorFlow). Solid grounding in statistics, probability, and experimental design. Familiarity with data visualization tools (e.g., Tableau, Power BI) for communicating insights. Strong analytical thinking and ability to translate complex data problems into business solutions. </SkillsAndTools> </Option> <Option title="Data Engineer" number="6"> <PrimaryFocus> Design, build, and maintain scalable data pipelines and architectures that enable efficient data collection, storage, and analysis. </PrimaryFocus> <KeyResponsibilities> Develop and optimize data ingestion and transformation processes (ETL/ELT), ensuring high performance and reliability. Implement and manage data workflows, integrating internal and external data sources. Collaborate with Data Scientists, AI/ML Engineers, and other stakeholders to ensure data readiness for analytics and model training. Monitor data pipelines for performance, reliability, and cost-effectiveness, taking corrective actions when needed. Maintain data quality and governance standards, including metadata management and data cataloging. </KeyResponsibilities> <SkillsAndTools> Proficiency in Python, SQL, and distributed data processing frameworks (e.g., Spark, Kafka). Experience with cloud-based data ecosystems (AWS, GCP, or Azure), and related storage/processing services (e.g., S3, BigQuery, Dataflow). Familiarity with infrastructure-as-code and DevOps tools (Terraform, Docker, Kubernetes) for automating data platform deployments. Strong understanding of database systems (relational, NoSQL) and data modeling principles. Knowledge of data orchestration and workflow management tools (Airflow, Luigi, Dagster). </SkillsAndTools> </Option> <Option title="Product Manager" number="7"> <PrimaryFocus> Drive the product vision and strategy, ensuring alignment with business goals and user needs while delivering impactful AI-driven solutions. </PrimaryFocus> <KeyResponsibilities> Conduct user and market research to identify opportunities, define product requirements, and set success metrics. Collaborate with cross-functional teams (Engineering, Data Science, Design) to prioritize features and plan releases. Develop and communicate product roadmaps, ensuring stakeholders are aligned on goals and timelines. Monitor product performance through data analysis and user feedback, iterating on improvements and new feature ideas. Facilitate agile development practices, writing clear user stories and acceptance criteria. </KeyResponsibilities> <SkillsAndTools> Strong understanding of product lifecycle management and agile methodologies (Scrum/Kanban). Excellent communication, negotiation, and stakeholder management skills. Experience with product management and collaboration tools (e.g., Jira, Confluence, Trello). Analytical mindset for leveraging metrics, A/B testing, and user feedback in decision-making. Familiarity with AI/ML concepts and the ability to translate technical possibilities into viable product features. </SkillsAndTools> </Option> <Option title="Software Engineer" number="8"> <PrimaryFocus> Design, develop, and maintain high-quality software applications and services that address user needs and align with overall business objectives. </PrimaryFocus> <KeyResponsibilities> Collaborate with cross-functional teams (Product, Design, QA) to interpret requirements and deliver robust solutions. Write clean, efficient, and testable code following best practices and coding standards. Participate in system architecture and design discussions, contributing to the evolution of technical roadmaps. Perform code reviews and provide constructive feedback to peers, maintaining a high bar for code quality. Implement and maintain CI/CD pipelines to streamline deployment and reduce manual interventions. Continuously improve system performance and scalability through profiling and optimization. </KeyResponsibilities> <SkillsAndTools> Proficiency in one or more programming languages (e.g., Java, Python, JavaScript, C++). Experience with modern frameworks/libraries (e.g., Spring Boot, Node.js, React, Django). Solid understanding of software design principles (e.g., SOLID, DRY) and architectural patterns (e.g., microservices). Familiarity with version control systems (Git), testing frameworks, and agile methodologies. Working knowledge of containerization (Docker), orchestration (Kubernetes), and cloud platforms (AWS, Azure, GCP). </SkillsAndTools> </Option> """ PROMPTS = { "prompt": """ You will be given a list of Responsibilities and Skills listed for a job. Your task is to determine if the job is a good fit with any of the Options, and if so, which one. <Job> <Responsibilities> {responsibilities} </Responsibilities> <Skills> {skills} </Skills> </Job> <Options> {Options} </Options> Start by thinking step-by-step about the Job and its Responsibilities and Skills, in relation to each of the Options. Decide if the Job is a good fit with ANY of the Options. If NONE of the Options are relevant to the Job, say so and return a 0 as your FinalAnswer. Otherwise, decide which of the Options is the most similar to the Job and return its number as your FinalAnswer. Return your response in the following format: <Analysis> [Your analysis of the job and its Responsibilities and Skills, in relation each of the Options] </Analysis> <FinalAnswer> 0|1|2|3|4|5|6|7|8 </FinalAnswer> """ } # Enable nested event loops nest_asyncio.apply() @dataclass class JobData: """Represents a job posting with extracted information.""" filename: str responsibilities: List[str] skills: List[str] @dataclass class ClassificationResult: """Represents the result of a job classification.""" filename: str responsibilities: List[str] skills: List[str] analysis: str role_classification: Optional[int] role_title: Optional[str] class JobClassifier: """Handles classification of jobs into predefined roles.""" def __init__(self, output_dir: str = 'role_classifications', batch_size: int = 3): self.output_dir = output_dir self.batch_size = batch_size self.semaphore = asyncio.Semaphore(batch_size) os.makedirs(output_dir, exist_ok=True) async def classify_job(self, job: JobData) -> ClassificationResult: """Classify a single job listing.""" logging.info(f"Classifying job '{job.filename}'") prompt = PROMPTS["prompt"].format( responsibilities=job.responsibilities, # Access as attribute instead of dict skills=job.skills, # Access as attribute instead of dict Options=JOB_DESCRIPTIONS ) try: response = await get_completion(prompt) await asyncio.sleep(5) # Rate limiting return self._parse_response(job, response) except Exception as e: logging.error(f"Failed to classify job '{job.filename}': {str(e)}") return ClassificationResult( filename=job.filename, responsibilities=job.responsibilities, skills=job.skills, analysis=f'Failed to process: {str(e)}', role_classification=None, role_title=None ) def _parse_response(self, job: JobData, response: str) -> ClassificationResult: """Parse LLM response into structured format.""" soup = BeautifulSoup(f"<root>{response}</root>", 'lxml-xml') analysis = soup.find('Analysis') role_choice = soup.find('FinalAnswer') role_number = int(role_choice.text.strip()) if role_choice else None role_title = self._get_role_title(role_number) return ClassificationResult( filename=job.filename, responsibilities=job.responsibilities, skills=job.skills, analysis=analysis.text.strip() if analysis else '', role_classification=role_number, role_title=role_title ) def _get_role_title(self, role_number: Optional[int]) -> Optional[str]: """Get the title for a role number.""" if role_number is None: return None if role_number == 0: return "Other" wrapped_xml = f"<root>{JOB_DESCRIPTIONS}</root>" job_descriptions_soup = BeautifulSoup(wrapped_xml, 'lxml-xml') matching_job = job_descriptions_soup.find('Option', {'number': str(role_number)}) if matching_job: return matching_job['title'] logging.error(f"No matching job found for role number {role_number}") return None class JobProcessor: """Handles the processing of job data files.""" def __init__(self, input_dir: str = 'json_extracted_data'): self.input_dir = input_dir self.classifier = JobClassifier() def load_jobs(self) -> List[JobData]: """Load jobs from JSON files.""" jobs = [] # Get list of all JSON files total_files = len([f for f in os.listdir(self.input_dir) if f.endswith('.json')]) # Get files that were classified as GenAI-relevant classified_files = self._get_classified_files() # Get files that have already been processed processed_files = { f[:-5] for f in os.listdir(self.classifier.output_dir) if f.endswith('.json') } # Get relevant files that haven't been processed yet files_to_process = classified_files - processed_files logging.info(f"Found {total_files} total files") logging.info(f"Found {len(classified_files)} relevant GenAI files") logging.info(f"Already processed: {len(processed_files)} files") logging.info(f"Remaining to process: {len(files_to_process)} files") # Only process files that are both relevant and unprocessed for filename in os.listdir(self.input_dir): if not filename.endswith('.json'): continue name = filename[:-5] if name not in files_to_process: continue try: job = self._load_job_file(filename) if job: jobs.append(job) except Exception as e: logging.error(f"Error processing {filename}: {str(e)}") logging.info(f"Processing {len(jobs)} remaining jobs") return jobs def _load_job_file(self, filename: str) -> Optional[JobData]: """Load and parse a single job file.""" file_path = os.path.join(self.input_dir, filename) try: with open(file_path, 'r') as f: data = json.load(f) if 'responsibilities' in data and 'skills' in data: return JobData( filename=filename[:-5], responsibilities=data['responsibilities'], skills=data['skills'] ) except json.JSONDecodeError: logging.error(f"Invalid JSON in {filename}") return None def _get_classified_files(self) -> Set[str]: """Get set of files that have been previously classified as GenAI-related.""" genai_dir = 'role_genai_classifications' if not os.path.exists(genai_dir): return set() genai_files = set() for f in os.listdir(genai_dir): if f.endswith('.json'): try: with open(os.path.join(genai_dir, f), 'r') as file: data = json.load(file) if data.get('is_genai_role') is True: # Explicitly check for True genai_files.add(f[:-5]) except Exception as e: logging.error(f"Error reading {f}: {e}") return genai_files async def process_jobs(self) -> None: """Process all jobs.""" jobs = self.load_jobs() if not jobs: return logging.info(f"Processing {len(jobs)} relevant jobs") await self._process_jobs_batch(jobs) logging.info("Job classification complete") async def _process_jobs_batch(self, jobs: List[JobData]) -> None: """Process a batch of jobs concurrently.""" async def process_with_semaphore(job: JobData) -> None: async with self.classifier.semaphore: result = await self.classifier.classify_job(job) self._save_result(result) await asyncio.gather(*[process_with_semaphore(job) for job in jobs]) def _save_result(self, result: ClassificationResult) -> None: """Save classification result to file.""" filename = os.path.join(self.classifier.output_dir, f"{result.filename}.json") with open(filename, 'w') as f: json.dump(vars(result), f, indent=2) logging.info(f"Saved classification for job {result.filename}") def classify_job_roles(df: pd.DataFrame) -> None: """Main entry point for job classification.""" if df.empty: logging.warning("Empty DataFrame provided") return processor = JobProcessor() loop = asyncio.get_event_loop() loop.run_until_complete(processor.process_jobs()) ``` ```{python} import logging logging.basicConfig(level=logging.INFO) loader = JobDataLoader() jobs = loader.read_json_files() df = pd.DataFrame(jobs) classify_job_roles(df) ``` # Load final data Now I'll load and combine all of the different data outputs into a single dataframe. The code block below uses a DataLoader class with helper classes DataPaths and SalaryProcessor to handle the data processing pipeline. The code merges job posting data with role classifications, processes salary information by converting CAD to USD and extracting salary ranges, and filters the results to only include the four specific AI-related roles of interest: AI Research Scientist, AI Solution Architect, AI/ML Engineer, and MLOps/AI Infrastructure Engineer. The code includes error handling for JSON file loading and salary processing. The final output is a pandas DataFrame containing the merged and processed data, which includes job details, role classifications, and standardized salary information. ```{python} #| code-fold: true #| code-fold-show: false #| code-summary: "Click to view the code for loading and merging all data sources" from typing import List, Tuple, Optional, Dict import pandas as pd import json import hashlib from dataclasses import dataclass from pathlib import Path @dataclass class DataPaths: """Configuration for data file paths""" ROLE_CLASSIFICATIONS_DIR: Path = Path('role_classifications') JSON_DATA_DIR: Path = Path('json_data') class DataLoader: def __init__(self, paths: DataPaths = DataPaths()): self.paths = paths self.salary_processor = SalaryProcessor() def load_and_merge_data(self) -> pd.DataFrame: """Load and merge all data sources""" # Load and merge base data merged_df = self._load_base_data() # Process salaries salary_data = self.salary_processor.extract_salary_ranges(merged_df['salary']) merged_df = pd.concat([merged_df, salary_data], axis=1) # Filter out any jobs not in this list merged_df = merged_df[merged_df['role_title'].isin( ['AI Research Scientist', 'AI Solution Architect', 'AI/ML Engineer', 'MLOps / AI Infrastructure Engineer'] )] print(f"Shape of merged DataFrame: {merged_df.shape}") return merged_df def _load_base_data(self) -> pd.DataFrame: """Load and merge base data sources""" # Load role classifications role_data = pd.DataFrame(self._load_json_files(self.paths.ROLE_CLASSIFICATIONS_DIR)) # Load and process jobs data jobs_data = pd.DataFrame(self._load_json_files(self.paths.JSON_DATA_DIR)) jobs_data['filename'] = jobs_data['url'].apply( lambda x: f"j{hashlib.md5(x.encode()).hexdigest()[:5]}" ) # Merge dataframes merged_df = pd.merge(jobs_data, role_data, on='filename', how='left') return merged_df def _load_json_files(self, directory: Path) -> List[Dict]: """Load JSON files from directory""" json_files = [] for file_path in directory.glob('*.json'): try: with open(file_path, 'r') as f: json_files.append(json.load(f)) except Exception as e: print(f"Error reading {file_path}: {e}") return json_files class SalaryProcessor: """Handles salary-related processing""" CAD_TO_USD_RATE = 1.44 # 1 USD = 1.44 CAD def extract_salary_ranges(self, salary_series: pd.Series) -> pd.DataFrame: """Extract salary ranges from a series of salary strings""" salary_data = salary_series.apply(self._extract_salary_range) return pd.DataFrame({ 'min_salary': salary_data.apply(lambda x: x[0]), 'max_salary': salary_data.apply(lambda x: x[1]), 'mid_salary': salary_data.apply(lambda x: (x[0] + x[1])/2 if x[0] and x[1] else None) }) def _extract_salary_range(self, salary_str: str) -> Tuple[Optional[float], Optional[float]]: """Extract minimum and maximum salary from salary string""" try: if not isinstance(salary_str, str) or '0+' in salary_str: return None, None # Determine currency and convert if needed is_cad = 'CAD' in salary_str nums = (salary_str.replace('CAD ', '') .replace('USD ', '') .replace('K', '') .split(' - ')) if len(nums) != 2: return None, None # Convert to float and apply CAD conversion if needed min_salary = float(nums[0]) max_salary = float(nums[1]) if is_cad: min_salary *= self.CAD_TO_USD_RATE max_salary *= self.CAD_TO_USD_RATE return min_salary, max_salary except Exception: return None, None ``` ```{python} # Initialize the loader and load the data # Check if the csv file exists if not os.path.exists('genai_job_data.csv'): loader = DataLoader() df = loader.load_and_merge_data() df.to_csv('genai_job_data.csv', index=False) else: df = pd.read_csv('genai_job_data.csv') # Now df contains all the merged data with processed salaries print(f"Total jobs loaded: {len(df)}") # Example operations you can do with the DataFrame: # View available columns print("\nAvailable columns:") print(df.columns.tolist()) # View distribution of roles print("\nRole distribution:") print(df['role_title'].value_counts()) ``` # Salary analysis First I'll analyze the salary data for each role and experience level. I'll use a RandomForest model to adjust for location differences, and convert any CAD denominated salaries to USD. I'm interested to see what compensation is like for each role, and if there are any significant differences in compensation between the roles. The code below implements a salary analysis system for the job posting data. It processes salary information across different roles and experience levels, handling currency conversions (CAD to USD) and adjusting for location differences using a RandomForest model. For each role and level combination, it calculates salary ranges, removes statistical outliers, and provides sample sizes. The system includes configuration settings for analysis parameters and comprehensive error handling. The code uses pandas for data manipulation and scikit-learn for the location adjustment model, with results formatted as salary ranges in USD. ```{python} #| code-fold: true #| code-fold-show: false #| code-summary: "Click to view the code for salary analysis" from dataclasses import dataclass from typing import Dict, List, Optional, TypedDict, Set import logging import pandas as pd from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import cross_val_score @dataclass class SalaryConfig: """Configuration settings for salary analysis""" outlier_threshold: float = 1.5 n_estimators: int = 100 min_group_size: int = 3 currency_symbol: str = '$' cv_folds: int = 5 class SalaryColumns(TypedDict): """Type definitions for salary DataFrame columns""" min_salary_usd_k: float max_salary_usd_k: float median_salary_usd_k: float sample_size: int location_adjusted_zscore: float class SalaryAnalyzer: """Analyzes salary distributions across role titles. Column Naming Conventions: - *_usd_k: Values in thousands of USD - *_zscore: Standardized scores (mean=0, std=1) - sample_size: Number of data points in group - *_median: Median values - location_adjusted_*: Values adjusted for location differences Attributes: config: Configuration settings for analysis logger: Logger instance for the class """ # Input column names COL_MIN_SALARY = 'min_salary' COL_MAX_SALARY = 'max_salary' COL_ROLE_TITLE = 'role_title' COL_LEVEL = 'level' COL_LOCATION = 'location' # Processed column names COL_MEDIAN_SALARY = 'median_salary_usd_k' COL_MIN_SALARY_MEDIAN = 'min_salary_median_usd_k' COL_MAX_SALARY_MEDIAN = 'max_salary_median_usd_k' COL_SAMPLE_SIZE = 'sample_size' # Z-score columns COL_MIN_ZSCORE = 'min_salary_location_adjusted_zscore' COL_MID_ZSCORE = 'median_salary_location_adjusted_zscore' COL_MAX_ZSCORE = 'max_salary_location_adjusted_zscore' def __init__(self, config: Optional[SalaryConfig] = None): """Initialize SalaryAnalyzer with optional configuration.""" self.config = config or SalaryConfig() self.logger = logging.getLogger(__name__) self._model = None # Cache for trained model def analyze_role_salaries(self, df: pd.DataFrame) -> pd.DataFrame: """ Analyze salary distributions across role titles with location adjustment. Args: df: DataFrame containing columns: - min_salary: Minimum salary (float) - max_salary: Maximum salary (float) - role_title: Job role (str) - level: Experience level (str) - location: Job location (str) Returns: DataFrame with aggregated salary statistics by role and level Raises: ValueError: If required columns are missing or DataFrame is empty """ self._validate_input(df) self.logger.info(f"Starting salary analysis with {len(df)} records") # Create working copy of DataFrame analysis_df = self._prepare_data(df) # Process and clean data analysis_df = self._process_data(analysis_df) # Generate final results results = self._aggregate_results(analysis_df) self.logger.info("Salary analysis completed successfully") return results def _validate_input(self, df: pd.DataFrame) -> None: """Validate input DataFrame.""" if df is None or df.empty: raise ValueError("Input DataFrame cannot be None or empty") required_cols = { self.COL_MIN_SALARY, self.COL_MAX_SALARY, self.COL_ROLE_TITLE, self.COL_LEVEL, self.COL_LOCATION } missing_cols = required_cols - set(df.columns) if missing_cols: raise ValueError(f"DataFrame missing required columns: {missing_cols}") def _prepare_data(self, df: pd.DataFrame) -> pd.DataFrame: """Prepare data for analysis by filtering valid salaries.""" valid_salary_mask = ( df[self.COL_MIN_SALARY].notna() & df[self.COL_MAX_SALARY].notna() ) if not valid_salary_mask.any(): self.logger.warning("No valid salary data found") return pd.DataFrame() # Create working copy and calculate median salary analysis_df = df[valid_salary_mask].copy() analysis_df[self.COL_MEDIAN_SALARY] = ( analysis_df[[self.COL_MIN_SALARY, self.COL_MAX_SALARY]].mean(axis=1) ) return analysis_df def _remove_outliers(self, group: pd.DataFrame) -> pd.DataFrame: """Remove statistical outliers from salary data.""" if len(group) <= self.config.min_group_size: return group # Only check min and max salary for outliers salary_cols = [self.COL_MIN_SALARY, self.COL_MAX_SALARY] for col in salary_cols: Q1 = group[col].quantile(0.25) Q3 = group[col].quantile(0.75) IQR = Q3 - Q1 outlier_mask = ~( (group[col] < (Q1 - self.config.outlier_threshold * IQR)) | (group[col] > (Q3 + self.config.outlier_threshold * IQR)) ) group = group[outlier_mask] # Recalculate median after removing outliers group[self.COL_MEDIAN_SALARY] = ( group[self.COL_MIN_SALARY] + group[self.COL_MAX_SALARY] ) / 2 return group def _aggregate_results(self, df: pd.DataFrame) -> pd.DataFrame: """Aggregate and format the final results.""" # First calculate group statistics grouped = df.groupby([self.COL_ROLE_TITLE, self.COL_LEVEL]).agg({ self.COL_MIN_SALARY: ['count', 'median'], self.COL_MAX_SALARY: 'median', }).round(0) # Calculate median salary from min and max medians grouped['median_salary'] = ( grouped[(self.COL_MIN_SALARY, 'median')] + grouped[(self.COL_MAX_SALARY, 'median')] ) / 2 # Flatten and rename columns grouped.columns = [ 'sample_size' if col == (self.COL_MIN_SALARY, 'count') else 'min_salary_usd_k' if col == (self.COL_MIN_SALARY, 'median') else 'max_salary_usd_k' if col == (self.COL_MAX_SALARY, 'median') else 'median_salary_usd_k' if col[0] == 'median_salary' else col for col in grouped.columns ] # Order columns ordered_cols = [ 'sample_size', 'min_salary_usd_k', 'max_salary_usd_k', 'median_salary_usd_k' ] result = grouped.reindex(columns=ordered_cols) # Format salary values salary_cols = [col for col in ordered_cols if col.endswith('_usd_k')] for col in salary_cols: result[col] = result[col].apply( lambda x: f"{self.config.currency_symbol}{x:,.0f}K" if pd.notna(x) else "N/A" ) return result def _process_data(self, df: pd.DataFrame) -> pd.DataFrame: """Process and clean the salary data.""" # Remove outliers by group df = (df.groupby([self.COL_ROLE_TITLE, self.COL_LEVEL]) .apply(self._remove_outliers) .reset_index(drop=True)) # Add location-adjusted scores df = self._add_location_adjusted_scores(df) # Clean level names df[self.COL_LEVEL] = (df[self.COL_LEVEL] .str.replace('-Level / Level', '', regex=False) .str.replace('-level', '', regex=False)) return df def _train_model(self) -> RandomForestRegressor: """Train and cache RandomForest model.""" if self._model is None: self._model = RandomForestRegressor( n_estimators=self.config.n_estimators, random_state=42 ) return self._model def _add_location_adjusted_scores(self, df: pd.DataFrame) -> pd.DataFrame: """Add location-adjusted z-scores to the dataframe.""" location_dummies = pd.get_dummies(df[self.COL_LOCATION], prefix='loc') model = self._train_model() score_columns = { self.COL_MIN_SALARY: self.COL_MIN_ZSCORE, self.COL_MEDIAN_SALARY: self.COL_MID_ZSCORE, self.COL_MAX_SALARY: self.COL_MAX_ZSCORE } for source_col, target_col in score_columns.items(): df[target_col] = self._adjust_salaries( df[source_col], location_dummies, model ) return df def _adjust_salaries(self, salary_series: pd.Series, X: pd.DataFrame, model: RandomForestRegressor) -> pd.Series: """Adjust salaries using RandomForest model.""" # Evaluate model performance scores = cross_val_score( model, X, salary_series, cv=self.config.cv_folds ) self.logger.debug(f"Cross-validation scores: {scores.mean():.3f} ± {scores.std():.3f}") # Fit model and calculate residuals model.fit(X, salary_series) expected = model.predict(X) residuals = salary_series - expected # Return standardized residuals return (residuals - residuals.mean()) / residuals.std() ``` ```{python} salary_analyzer = SalaryAnalyzer() salary_df = salary_analyzer.analyze_role_salaries(df) salary_df ``` Looking across all the four GenAI roles (AI Research, Solutions Architecture, ML Engineering, and MLOps), compensation bands are similar. At senior levels (where I have the largest samples), median salaries cluster tightly between ~$195K-$210K USD. Mid-level positions show medians between ~$165K-$180K USD, and entry-level positions generally start between ~$155K-$205K USD, although this entry-level data is limited by small samples. This consistency suggests these roles are valued similarly in the market, despite their different focuses. # Job responsibilities Next, I'm interested to see what the job responsibilities are for some of these roles, and how they differ. I'll use an LLM to identify the most common responsibilities and skills for a given role. This task will be quite complex, involving a lot of input data. So a model like `o1` will be more appropriate. I don't have access to `o1` via API, so I'll use the web interface to generate completions, and prepare XML files for the inputs to this task -- allowing me to copy-paste the XML into the chat. As we'll see, Research Scientists lean heavily into discovery, theoretical advances, publication, and cutting-edge experimentation, whereas ML Engineers center on production-grade systems, robust architecture, MLOps, and aligning with business needs. Although in many cases, the lines between these roles will be blurred. ```{python} #| code-fold: true #| code-fold-show: false #| code-summary: "Click to view the code for XML generation" import logging import os from typing import Dict import pandas as pd INSTRUCTION_PREFIX = """ Identify the most common Responsibilities and Skills listed in the jobs below, returning a bulleted list in the format # Responsibilities - [responsibility]: [responsibility description] - [responsibility]: [responsibility description] ... # Skills - [skill]: [skill description] - [skill]: [skill description] ... """ class XMLGenerator: """Handles generation of XML files from job data""" def __init__(self, output_dir: str = 'xml_roles'): self.output_dir = output_dir os.makedirs(output_dir, exist_ok=True) logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s' ) def compile_xml_for_role(self, level: str, role_title: str, df: pd.DataFrame) -> Dict: """Generate XML-formatted string of job details for the given level and role title.""" # Filter DataFrame for matching level and role filtered_df = df[ (df['level'] == level) & (df['role_title'] == role_title) ] if len(filtered_df) == 0: return { 'level': level, 'role_title': role_title, 'xml_output': '', 'error': f"No jobs found for {level} {role_title}" } # Build XML string xml_output = "" for _, job in filtered_df.iterrows(): xml_output += "<Job>\n" xml_output += f"<Responsibilities>{job['responsibilities']}</Responsibilities>\n" xml_output += f"<Skills>{job['skills']}</Skills>\n" xml_output += "</Job>\n" return { 'level': level, 'role_title': role_title, 'xml_output': INSTRUCTION_PREFIX + xml_output } def save_xml(self, level: str, role_title: str, xml_data: Dict) -> None: """Save XML data to a file""" if not xml_data['xml_output']: #logging.warning(xml_data.get('error', 'Empty XML output')) return # Clean and sanitize the filename components cleaned_level = level.replace('-Level / Level', '').strip() safe_level = cleaned_level.replace('/', '_').replace(' ', '_') safe_role = role_title.replace('/', '_').replace(' ', '_') filename = os.path.join(self.output_dir, f"{safe_level}_{safe_role}.xml") with open(filename, 'w') as f: f.write(xml_data['xml_output']) logging.debug(f"Saved XML for {cleaned_level} {role_title} to {filename}") def generate_all_xml(self, df: pd.DataFrame) -> None: """Generate XML files for all role/level combinations""" if df.empty: logging.warning("No jobs found in DataFrame - check if data was loaded correctly") return required_cols = ['level', 'role_title', 'responsibilities', 'skills'] # Changed here too if not all(col in df.columns for col in required_cols): logging.error(f"DataFrame missing required columns: {required_cols}") return # Process each unique role/level combination role_combinations = df[['level', 'role_title']].drop_duplicates() for _, combo in role_combinations.iterrows(): xml_data = self.compile_xml_for_role( combo['level'], combo['role_title'], df ) self.save_xml(combo['level'], combo['role_title'], xml_data) ``` ```{python} xml_generator = XMLGenerator() xml_generator.generate_all_xml(df) ``` ## o1-generated Senior AI Research Scientist Here's what `o1` said about the Responsibilities and Skills observed in Senior AI Research Scientist roles: ### Responsibilities - **Conduct advanced AI research** Many roles require pushing the state-of-the-art in Generative AI, LLMs, and related areas (e.g., video/multimodal models, diffusion models) through novel algorithms, architectures, and experimental studies. - **Train and fine-tune large-scale models** Commonly involves working with massive datasets and distributed training setups (thousands of GPUs, HPC environments) to develop foundation models and advanced AI systems. - **Develop and implement new algorithms or architectures** Spans designing novel model architectures (e.g., diffusion, transformer-based, multimodal fusion) and creating robust data processing or simulation pipelines to support AI solutions. - **Collaborate with cross-functional teams** Emphasizes close work with engineering, product management, research, and external stakeholders to integrate AI breakthroughs into real-world applications and products. - **Evaluate and measure AI performance** Entails building rigorous evaluation frameworks, designing new metrics, and systematically analyzing model behavior to ensure quality and reliability. - **Publish and communicate research findings** Many positions highlight writing influential papers, presenting at conferences, and sharing innovative results both internally and with the broader AI community. - **Build and maintain data pipelines** Involves constructing high-quality, scalable data pipelines or tooling to support training, fine-tuning, and inference of large models. - **Ensure production-grade implementation** Requires writing clean, efficient, and maintainable code as well as optimizing models and pipelines to meet performance, reliability, and quality standards. ### Skills - **Proficiency in Python and deep learning frameworks** Strong coding skills in Python and hands-on experience with libraries such as PyTorch, TensorFlow, or JAX appear in nearly every role. - **Expertise with LLMs and Generative AI** Deep understanding of transformer architectures, diffusion models, multimodal systems, prompt engineering, and other advanced AI techniques is frequently mentioned. - **Experience with large-scale/distributed training** Many roles emphasize knowledge of HPC, GPU optimization, model parallelism (e.g., FSDP, DeepSpeed, Megatron-LM), and handling massive datasets. - **Strong software engineering practices** Testing, code review, debugging, version control, and producing clean, modular research or production code are consistently important. - **Collaboration and communication skills** Clear written and verbal communication, along with cross-functional teamwork, is vital for integrating AI solutions into products and relaying complex ideas. - **Research acumen and adaptability** Ability to read, interpret, and prototype cutting-edge AI literature, publish findings, and rapidly iterate on experiments. - **Machine Learning fundamentals** Solid grounding in ML theory (e.g., optimization, statistics, data structures) and experience with model evaluation, data manipulation, and pipeline design. - **Familiarity with prompt engineering and advanced NLP concepts** Many roles highlight crafting effective prompts, aligning model outputs with user needs, and leveraging text-generation or conversational AI techniques. ## o1-generated Senior AI ML Engineer And here's what `o1` said about the Responsibilities and Skills observed in Senior AI ML Engineer roles: ### Responsibilities - **Design, develop, and deploy AI/ML solutions** End-to-end creation of machine learning systems, from initial prototypes to production-ready deployments. - **Collaborate with cross-functional teams** Work closely with product managers, data scientists, engineers, and other stakeholders to align technical solutions with business goals. - **Monitor and optimize model performance** Track key metrics, fine-tune models, and iterate to ensure continuous improvement and reliability in production. - **Stay current with AI research and emerging technologies** Keep up-to-date with the latest breakthroughs in areas like LLMs, generative AI, and deep learning. - **Mentor and coach team members** Provide guidance on best practices, design patterns, code quality, and career development for junior or peer engineers. - **Develop scalable data/ML pipelines** Build robust infrastructure for data collection, preprocessing, model training, and deployment at scale. - **Implement and maintain CI/CD and coding best practices** Ensure code quality, streamline release processes, and enforce testing discipline for AI/ML components. - **Integrate and leverage LLMs/generative AI** Incorporate large language models or generative methods into products and workflows. - **Prototype and experiment** Conduct R&D, proof-of-concepts, and pilot programs to explore emerging AI techniques and validate new product ideas. - **Document and communicate findings** Produce clear technical documentation, share results with stakeholders, and provide actionable insights for decision-making. ### Skills - **Proficiency in Python** Commonly required for AI/ML development, data manipulation, and scripting. - **Experience with ML/DL frameworks** Hands-on expertise in tools like PyTorch, TensorFlow, or JAX for building and training models. - **Familiarity with cloud platforms** Working knowledge of AWS, GCP, or Azure for deploying and scaling AI solutions. - **Expertise in LLMs/generative AI** Understanding of transformer architectures, prompt engineering, retrieval-augmented generation (RAG), and related libraries. - **Strong software engineering fundamentals** Solid grasp of algorithms, data structures, design patterns, and best practices for production code. - **Knowledge of MLOps and CI/CD** Experience with containerization (Docker, Kubernetes), version control (Git), and automated testing/monitoring. - **Data processing and SQL** Skills in handling large datasets, working with Spark or similar frameworks, and writing performant SQL queries. - **Effective communication and collaboration** Ability to translate complex technical concepts for non-technical stakeholders and work well in diverse teams. - **Problem-solving and debugging** Track record of diagnosing issues in production environments and implementing reliable fixes. - **Continuous learning mindset** Eagerness to stay on top of new AI research, frameworks, and technologies to innovate and improve solutions. # Job titles What are the most common job titles for these roles? ```{python} #| code-fold: true #| code-fold-show: false #| code-summary: "Click to view the code for aggregating titles" def aggregate_titles_by_role_level(df: pd.DataFrame, output_dir: str = 'role_titles') -> None: """ Aggregate job titles for each role_title and level combination and save to JSON. Args: df: DataFrame containing 'role_title', 'level', and 'title' columns output_dir: Directory to save the JSON output """ # Create output directory if it doesn't exist os.makedirs(output_dir, exist_ok=True) # Group by role_title and level grouped = df.groupby(['role_title', 'level']) for (role, level), group in grouped: # Get unique titles for this combination result = { 'titles': group['title'].unique().tolist() } # Create safe filename from role and level safe_role = role.replace('/', '_').replace(' ', '_') safe_level = level.replace('/', '_').replace(' ', '_') filename = f"{safe_level}_{safe_role}.json" # Save to JSON file with open(os.path.join(output_dir, filename), 'w') as f: json.dump(result, f, indent=2) print(f"Saved titles for {level} {role}") ``` ```{python} aggregate_titles_by_role_level(df) ``` I'll use `o1` again to generate a summary of the most common titles for each role/level combination. In the prompt, I will omit my own role classification so as not to bias the results. I'll ask it to "identify the most common titles listed below (ignoring slight variations), and then for those titles identify how often they occurred." Here's a sample of what it said. For Senior AI Research Scientist roles, `o1` observed the following titles to be common to this role: - Research Scientist/Researcher (14 occurrences) - Research Engineer (7 occurrences) - Machine Learning Engineer (3 occurrences) - Generative AI Engineer (2 occurrences) - Software Engineer (Generative AI) (2 occurrences) For Senior AI ML Engineer roles, `o1` observed the following titles to be common to this role: - Senior Software Engineer (6 occurrences) - Senior Machine Learning Engineer (5 occurrences) - Staff Software Engineer (5 occurrences) - Senior AI Engineer (4 occurrences) - Machine Learning Researcher (3 occurrences) - Staff Machine Learning Engineer (2 occurrences) - Principal Software Engineer (2 occurrences)