A data-driven exploration of the GenAI/LLM job market for science and engineering roles in January 2025. I scrape ~1000 job postings from ai-jobs.net, perform data extraction and classification using LLMs, and then analyze the data to identify patterns and insights about the GenAI/LLM job market, including salary ranges, skill requirements, and role distribution.
As someone currently working in a “Prompt Engineering” role, I’ve been thinking a lot about how this title basically doesn’t exist outside of a handful of companies, how the title communicates a narrow range of skills and responsibilities,and how the work that I do day-to-day is much larger in scope than just writing prompts. I identify more as an AI Engineer or AI Research Scientist, and so I was interested to see what I could learn about other similar roles that work with GenAI and LLMs.
So with that motivation, I set about to collect some data and look at the job market for these sorts of roles. What responsibilities and skills are being advertised most often, what kinds of titles are being used for these roles, and what kind of compensation is being offered at different levels of seniority?
To accomplish this, I built a custom web scraper to collect ~1000 job postings from ai-jobs.net. My code gathers job details like the title, company, location, salary, and posting date – from a range of U.S. and Canadian cities, covering entry-level, mid-level, and senior positions. I then use a Large Language Model (LLM) to extract each job’s key responsibilities, required skills, and qualifications. Afterwards, I classify each position to see whether it involves working with Generative AI or large language models, and if so, categorize it further into four major AI roles: (1) AI Research Scientist, (2) AI/ML Engineer, (3) MLOps/AI Infrastructure Engineer, and (4) AI Solution Architect. I believe this set of roles is a good representation of different areas of focus.
Finally, I integrate the various metadata and classifications into a comprehensive dataset. I observe that GenAI/LLM positions command consistently high salary ranges across the four different roles, particularly at more senior levels. Senior-level roles tended to offer median salaries in the $195K–$210K range, while mid-level roles generally clustered around $165K–$180K. Entry-level salaries showed greater variation (likely due to the small sample size) but still landed in competitive ranges of roughly $155K–$205K in many postings. These roles often share common technical demands—like proficiency with large-scale model training, distributed computing, and LLM-specific knowledge—though each role emphasizes distinct priorities (research vs. production, for example).
Of course, this analysis is not without limitations. For example, I am relying on a single job board, and I scraped jobs during a limited window of time. I also have not rigorously validated the LLM classifications – although I have implemented many prompt engineering best practices, and have used some of the more powerful LLMs (4o and o1). To some extent, the responsibilities and skills that were recovered from the original job postings classified into the four pre-defined roles do speak to the relative accuracy of the role classifications. The salary analysis also does not distinguish between base salary and total compensation, remote vs. in-person opportunities, large vs. small companies, and so on. But overall, I think this gives some sense of the job market for these roles.
Scraping job postings
I’ll scrape job postings from ai-jobs.net. This is an aggregator that specializes in AI jobs of all kinds, sourcing jobs from over 60 countries, and in my experience, it does a pretty good job.
The code below implements a web scraper for job postings from ai-jobs.net. It collects job postings from major cities in the US and Canada, searching across entry-level, mid-level, and senior positions. For each job, it gets the title, company, location, salary, description, and posting date. The scraper saves each job as a JSON file and keeps track of what it has already scraped to avoid duplicates. It includes error handling and logging to track any problems that occur during scraping.
Click to view the Job Posting Scraper code
from dataclasses import dataclassfrom typing import List, Dict, Optional, Setfrom datetime import datetime, timedeltaimport loggingimport jsonimport requestsfrom bs4 import BeautifulSoupfrom pathlib import Pathimport reimport time@dataclassclass JobData:"""Structured container for job posting data""" title: str company: str location: str level: Optional[str] salary: Optional[str] url: str description: str scraped_date: str posted_date: Optional[str] raw_data: strclass ScrapeConfig:"""Configuration settings for the scraper""" EXPERIENCE_LEVELS = ['EN', 'MI', 'SE'] # Entry, Mid, Senior CITIES = {'5391959': 'San Francisco','5128581': 'New York City','6167865': 'Toronto','6173331': 'Vancouver','5809844': 'Seattle','4671654': 'Austin','4930956': 'Boston','5': 'Region' } CATEGORIES = {'1': 'Research','2': 'Engineering','18': 'GenerativeAI' } BASE_URL ="https://ai-jobs.net"class SalaryExtractor:"""Handles salary extraction logic"""@staticmethoddef extract_from_schema(schema_data: Dict) -> Optional[str]:"""Extract salary from schema.org data"""try:if schema_data and'baseSalary'in schema_data: base_salary = schema_data.get('baseSalary', {}).get('value', {})if base_salary: min_value = base_salary.get('minValue') max_value = base_salary.get('maxValue') currency = base_salary.get('currency', 'USD') # Default to USD if not specifiedif min_value and max_value:returnf"{currency}{float(min_value)/1000:.0f}K - {float(max_value)/1000:.0f}K"elif min_value:returnf"{currency}{float(min_value)/1000:.0f}K+"exceptExceptionas e: logging.debug(f"Error extracting schema salary: {e}")returnNone@staticmethoddef extract_from_text(description: str) -> Optional[str]:"""Extract salary from description text""" patterns = [r'(?:salary range.*?)(?:CAD|\$)?([\d,]+)\s*-\s*(?:CAD|\$)?([\d,]+)',r'(?:salary.*?)(?:CAD|\$)?([\d,]+)(?:\s*-\s*(?:CAD|\$)?([\d,]+))?' ]for pattern in patterns:if match := re.search(pattern, description, re.IGNORECASE):try: groups = match.groups()# Check if salary is in CAD currency ='CAD'if'CAD'in description.upper() else'USD'iflen(groups) ==2and groups[1]: # Range format min_sal =float(groups[0].replace(',', '')) max_sal =float(groups[1].replace(',', ''))returnf"{currency}{min_sal/1000:.0f}K - {max_sal/1000:.0f}K"elif groups[0]: # Single value format base_sal =float(groups[0].replace(',', ''))returnf"{currency}{base_sal/1000:.0f}K+"exceptValueError:continuereturnNone@classmethoddef extract(cls, soup: BeautifulSoup, description: str, schema_data: Dict) -> Optional[str]:"""Main salary extraction method"""# Try schema data firstif salary := cls.extract_from_schema(schema_data):return salary# Try description textif salary := cls.extract_from_text(description):return salary# Try salary badgeif salary_badge := soup.find('span', class_='badge rounded-pill text-bg-success'): salary_text = salary_badge.text.strip()if re.search(r'(USD|\$|\d)', salary_text):return salary_textreturnNoneclass JobPageParser:"""Handles parsing of individual job pages"""def__init__(self, html: str):self.soup = BeautifulSoup(html, 'html.parser')self.raw_html = htmldef parse_schema_data(self) -> Dict:"""Parse schema.org JSON-LD data"""if script :=self.soup.find('script', type='application/ld+json'):try: cleaned_script = re.sub(r'[\x00-\x1F\x7F-\x9F]', '', script.string)return json.loads(cleaned_script)exceptExceptionas e: logging.warning(f"Could not parse JSON-LD data: {e}")return {}def parse_job_data(self, url: str) -> JobData:"""Extract all job data from the page""" schema_data =self.parse_schema_data() scraped_date = datetime.now().isoformat()# Get company name from schema or fallback to page element company = schema_data.get('hiringOrganization', {}).get('name')ifnot company:if company_elem :=self.soup.find('a', class_='company-name'): company = company_elem.text.strip()# Get job level from meta description level =Noneif meta_desc :=self.soup.find('meta', {'name': 'description'}):if level_match := re.search(r'a ([\w-]+level)', meta_desc['content']): level = level_match.group(1).replace('-', '-level / ').title()return JobData( title=self.soup.find('h1', class_='display-5').text.strip(), company=company, location=self.soup.find('h3', class_='lead').text.strip(), level=level, salary=SalaryExtractor.extract(self.soup,self.soup.find('div', class_='job-description-text').text.strip(), schema_data ), url=url, description=self.soup.find('div', class_='job-description-text').text.strip(), scraped_date=scraped_date, posted_date=self.calculate_posted_date(scraped_date), raw_data=self.raw_html )def calculate_posted_date(self, scraped_date_str: str) -> Optional[str]:"""Calculate posting date from 'posted X time ago' text"""if match := re.search(r'Posted (\d+) (hours?|days?|weeks?|months?) ago', self.raw_html): number =int(match.group(1)) unit = match.group(2) scraped_date = datetime.fromisoformat(scraped_date_str) delta = {'hour': timedelta(hours=number),'day': timedelta(days=number),'week': timedelta(weeks=number),'month': timedelta(days=number *30) # Approximate }.get(unit.rstrip('s'))if delta:return (scraped_date - delta).isoformat()returnNoneclass AIJobScraper:"""Main scraper orchestration class"""def__init__(self, output_dir: str='json_data'):self.output_dir = Path(output_dir)self.output_dir.mkdir(exist_ok=True)self.session = requests.Session()self.config = ScrapeConfig()self.existing_jobs =self._load_existing_jobs()def _load_existing_jobs(self) -> Set[str]:"""Load set of existing job URLs""" existing_jobs =set()for f inself.output_dir.glob('*.json'):try: data = json.loads(f.read_text()) existing_jobs.add(data['url'])exceptExceptionas e: logging.warning(f"Error reading {f}: {e}")return existing_jobsdef _generate_search_urls(self) -> List[str]:"""Generate all search URL combinations""" urls = []for exp inself.config.EXPERIENCE_LEVELS:for cat inself.config.CATEGORIES:for city inself.config.CITIES: url =f"{self.config.BASE_URL}/?" url +=f"cat={cat}" url +=f"&{'reg'if city =='5'else'cit'}={city}" url +=f"&typ=1&key=&exp={exp}&sal=" urls.append(url)return urlsdef get_job_urls(self, search_url: str) -> List[str]:"""Get all job URLs from a search page"""try: response =self.session.get(search_url) response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser')if job_list := soup.find('ul', id='job-list'):return [f"{self.config.BASE_URL}{link['href']}"for link in job_list.find_all('a', href=lambda h: h and'/job/'in h) ]exceptExceptionas e: logging.error(f"Error fetching job URLs: {e}")return []def scrape_job_details(self, url: str) -> Optional[JobData]:"""Scrape details from a job page"""try: response =self.session.get(url) response.raise_for_status() parser = JobPageParser(response.text)return parser.parse_job_data(url)exceptExceptionas e: logging.error(f"Error scraping job {url}: {e}")returnNonedef save_job(self, job: JobData) ->None:"""Save job data to JSON file"""try: safe_title = re.sub(r'[^\w\s-]', '', job.title) safe_company = re.sub(r'[^\w\s-]', '', job.company or'unknown') filename =f"{safe_company}_{safe_title}_{hash(job.url)}.json"self.output_dir.joinpath(filename).write_text( json.dumps(vars(job), indent=2, ensure_ascii=False) ) logging.info(f"Saved job: {job.title} at {job.company}")exceptExceptionas e: logging.error(f"Error saving job: {e}")raisedef scrape_jobs(self, test: bool=False) ->int:"""Main scraping method""" jobs_found = jobs_skipped =0for search_url inself._generate_search_urls(): logging.debug(f"Processing search URL: {search_url}") job_urls =self.get_job_urls(search_url)if test and job_urls: job_urls = job_urls[:1]for url in job_urls:if url inself.existing_jobs: jobs_skipped +=1continueif job_data :=self.scrape_job_details(url):self.save_job(job_data)self.existing_jobs.add(url) jobs_found +=1 time.sleep(1) # Rate limiting logging.info(f"Found {jobs_found} new jobs. Skipped {jobs_skipped} existing jobs.")return {'jobs_found': jobs_found, 'jobs_skipped': jobs_skipped}def scrape_jobs(test: bool=False, verbose: bool=False) ->int:"""Convenience function to run the scraper"""ifnot verbose: logging.getLogger().setLevel(logging.WARNING)try: scraper = AIJobScraper(output_dir=Path.cwd() /'json_data')return scraper.scrape_jobs(test=test)exceptExceptionas e: logging.error(f"Scraper failed: {e}")raise
For the classification tasks, I’ll write a general-purpose LLM completion function. This function takes a prompt and a model as parameters, and returns the completion from the OpenAI API.
from openai import AsyncOpenAIimport osasyncdef get_completion(prompt: str, model: str="gpt-4o-2024-08-06") ->str:"""Get a completion from the OpenAI API.""" client = AsyncOpenAI(api_key=os.getenv('OPENAI_API_KEY')) # Initialize with API key from env vars response =await client.chat.completions.create( model=model, messages=[{"role": "system", "content": prompt}], temperature=0 )return response.choices[0].message.content
Extract Responsibilities, Skills, and Qualifications
The code below uses an LLM to extract key information from job postings that will be used to classify them into different role categories. It uses OpenAI’s GPT-4o model to analyze job descriptions and extract three key components: responsibilities, skills, and qualifications. The system processes multiple jobs concurrently with rate limiting, saves the extracted data as JSON files, and includes retry logic for handling API rate limits. The code uses asyncio for concurrent processing and includes error handling and logging. It also checks for previously processed jobs to avoid duplicate work.
Also included is code that implements data preprocessing steps. This code has functions to load JSON files into a pandas DataFrame, uses a list of keywords to filter out certain jobs that are not relevant to the analysis, and removes duplicate or highly similar job descriptions using TF-IDF vectorization and cosine similarity. The main function process_job_listings() combines these steps, taking a directory path, similarity threshold, and filter keywords as parameters. It returns a dictionary containing the processed DataFrame along with counts of the original and filtered entries.
Click to view the code for LLM extraction
import osimport jsonimport loggingimport asyncioimport hashlibimport nest_asynciofrom typing import Dict, List, Optionalimport pandas as pdfrom bs4 import BeautifulSoupfrom tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_typefrom openai import AsyncOpenAI, RateLimitErrorfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import cosine_similarityimport re# Enable nested event loops for Jupyter notebooksnest_asyncio.apply()# Set up logging configurationlogging.basicConfig( level=logging.INFO,format='%(asctime)s - %(levelname)s - %(message)s')PROMPTS = {"prompt": """ You will be given a Job Description. Your task is to extract a list of Responsibilities, Skills, and Qualifications. - Responsibilities are the tasks and activities that the job requires the employee to perform. - Skills are the abilities and knowledge that the employee needs to have to perform the responsibilities. - Qualifications are the requirements that the employee needs to meet to be considered for the job. <JobDescription>{job_description} </JobDescription> Return a list of Responsibilities, Skills, and Qualifications as follows: <Responsibilities> [Bullet point list of responsibilities] </Responsibilities> <Skills> [Bullet point list of skills] </Skills> <Qualifications> [Bullet point list of qualifications] </Qualifications> """}class JobProcessor:def__init__(self, output_dir='json_extracted_data'):self.output_dir = output_dir os.makedirs(output_dir, exist_ok=True)self.semaphore = asyncio.Semaphore(3)self.processed =0self.skipped =0def _parse_llm_response(self, response: str) -> Dict[str, List[str]]:"""Parse LLM response into structured data.""" result = {"responsibilities": [],"skills": [],"qualifications": [] } sections = {"responsibilities": r'<Responsibilities>\n(.*?)\n</Responsibilities>',"skills": r'<Skills>\n(.*?)\n</Skills>',"qualifications": r'<Qualifications>\n(.*?)\n</Qualifications>' }for section, pattern in sections.items():if match := re.search(pattern, response, re.DOTALL): result[section] = [ item.strip('- ') for item in match.group(1).strip().split('\n')if item.strip('- ') # Filter out empty items ]return resultasyncdef extract_job_details(self, job_id: str, job_description: str) -> Dict[str, List[str]]:"""Extract structured information from job description using LLM."""try: prompt = PROMPTS["prompt"].format(job_description=job_description) response =await get_completion(prompt) result =self._parse_llm_response(response)# Save results json_path = os.path.join(self.output_dir, f'{job_id}.json')withopen(json_path, 'w', encoding='utf-8') as f: json.dump(result, f, indent=2, ensure_ascii=False)return resultexceptExceptionas e: logging.error(f"Error extracting job details: {str(e)}")return {"responsibilities": [], "skills": [], "qualifications": []}asyncdef process_single_job(self, job_id: str, description: str) -> Dict[str, List[str]]:"""Process a single job with caching.""" json_path = os.path.join(self.output_dir, f'{job_id}.json')if os.path.exists(json_path):self.skipped +=1 logging.debug(f"Skipping job ID {job_id} - already processed")withopen(json_path, 'r', encoding='utf-8') as f:return json.load(f)asyncwithself.semaphore:self.processed +=1 logging.info(f"Processing job ID: {job_id}")returnawaitself.extract_job_details(job_id, description)asyncdef process_jobs(self, df: pd.DataFrame) -> List[Dict[str, List[str]]]:"""Process multiple jobs concurrently.""" tasks = [self.process_single_job(str(row['id']), row['description'])for _, row in df.iterrows() ] results =await asyncio.gather(*tasks)print(f"\nProcessing complete:")print(f"- New jobs processed: {self.processed}")print(f"- Skipped jobs (already processed): {self.skipped}")return resultsclass DataPreprocessor:def__init__(self, similarity_threshold=0.8):self.similarity_threshold = similarity_thresholdself.default_filter_keywords = ['Dir ', 'Director', 'Intern', 'Data Scientist', 'Data Science','Content Writer', 'Faculty', 'Product Owner', 'Manager','Analyst', 'Postdoctoral', 'Postdoc', 'Summer' ]@staticmethoddef load_json_files(directory='json_data') -> pd.DataFrame:"""Load JSON files into DataFrame with error handling.""" df = pd.DataFrame()ifnot os.path.exists(directory): logging.error(f"Directory {directory} does not exist")return dfforfilein [f for f in os.listdir(directory) if f.endswith('.json')]:try:withopen(os.path.join(directory, file), 'r') as f: data = json.load(f) df = pd.concat([df, pd.DataFrame([data])], ignore_index=True)exceptExceptionas e: logging.error(f"Error loading {file}: {e}")ifnot df.empty: df['id'] = df['url'].apply(lambda x: f"j{hashlib.md5(x.encode()).hexdigest()[:5]}" )return dfdef filter_by_title(self, df: pd.DataFrame, filter_keywords: Optional[List[str]] =None) -> pd.DataFrame:"""Filter DataFrame by job titles."""if df.empty:return df keywords = filter_keywords orself.default_filter_keywordsreturn df[~df['title'].str.contains('|'.join(keywords), case=False, na=False )]def remove_similar_descriptions(self, df: pd.DataFrame) -> pd.DataFrame:"""Remove similar job descriptions using TF-IDF and cosine similarity."""if df.empty:return df tfidf = TfidfVectorizer(stop_words='english') tfidf_matrix = tfidf.fit_transform(df['description'].fillna('')) cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix) indices_to_remove = { j for i inrange(len(df))for j inrange(i +1, len(df))if cosine_sim[i][j] >self.similarity_threshold }return df.iloc[~df.index.isin(list(indices_to_remove))]def extract_jobs_from_dataframe(df: pd.DataFrame) -> List[Dict[str, List[str]]]:"""Wrapper function to process jobs from a DataFrame.""" processor = JobProcessor() loop = asyncio.get_event_loop()if loop.is_running():return loop.create_task(processor.process_jobs(df))else:return asyncio.run(processor.process_jobs(df))def process_job_listings(directory='json_data', similarity_threshold=0.8, filter_keywords=None) -> Dict:"""Main function to process job listings.""" preprocessor = DataPreprocessor(similarity_threshold)# Load and preprocess data df = preprocessor.load_json_files(directory) initial_count =len(df)if df.empty:return {'processed_data': df,'original_count': 0,'filtered_count': 0 } df = preprocessor.filter_by_title(df, filter_keywords) df_filtered = preprocessor.remove_similar_descriptions(df)return {'processed_data': df_filtered,'original_count': initial_count,'filtered_count': len(df_filtered) }
results = process_job_listings()df = results['processed_data']print(f'Original count of listings: {results["original_count"]}')print(f'Filtered count of listings: {results["filtered_count"]}')extract_jobs_from_dataframe(df)
Original count of listings: 999
Filtered count of listings: 680
<Task pending name='Task-1' coro=<JobProcessor.process_jobs() running at /tmp/ipykernel_1343524/2751672584.py:118>>
Classification
Classify jobs as relevant to GenAI/LLM work or not
Click to view the code for job classification as GenAI/LLM relevant or not
import nest_asyncioimport reimport asyncioimport loggingfrom typing import List, Dict, Setfrom bs4 import BeautifulSoupimport pandas as pdfrom tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_typefrom openai import AsyncOpenAIimport osfrom openai import RateLimitErrorimport jsonclass JobClassifierGenAI:"""Handles classification of jobs for GenAI/LLM work""" PROMPT =""" You will be given a list of Responsibilities and Skills listed for a job. Your task is to determine if the job involves working with Generative AI (GenAI) or language models (a.k.a. Large Language Models (LLMs)). <Job> <Responsibilities>{responsibilities} </Responsibilities> <Skills>{skills} </Skills> </Job> Start by thinking step-by-step about the Job and its Responsibilities and Skills, and whether it involves working with Generative AI (GenAI) or language models (a.k.a. Large Language Models (LLMs)). Return your response in the following format: <Analysis> [Your analysis of the job and its Responsibilities and Skills] </Analysis> <FinalAnswer> true|false </FinalAnswer> """def__init__(self, output_dir='role_genai_classifications', batch_size=3):self.output_dir = output_dirself.batch_size = batch_sizeself.semaphore = asyncio.Semaphore(batch_size) os.makedirs(output_dir, exist_ok=True)asyncdef classify_job(self, job: Dict) -> Dict:"""Classify a single job listing""" logging.info(f"Classifying job '{job['filename']}'") prompt =self.PROMPT.format( responsibilities=job['responsibilities'], skills=job.get('skills', '') )try: response =await get_completion(prompt)await asyncio.sleep(2) # Rate limitingexceptExceptionas e: logging.error(f"Failed to classify job '{job['filename']}': {str(e)}")return {**job,'analysis': f'Failed to process: {str(e)}','is_genai_role': None }returnself._parse_response(job, response)def _parse_response(self, job: Dict, response: str) -> Dict:"""Parse LLM response into structured format""" soup = BeautifulSoup(f"<root>{response}</root>", 'lxml-xml') analysis = soup.find('Analysis') final_answer = soup.find('FinalAnswer') is_genai_role =Noneif final_answer: answer_text = final_answer.text.strip().lower() is_genai_role =Trueif answer_text =='true'elseFalseif answer_text =='false'elseNonereturn {**job,'analysis': analysis.text.strip() if analysis else'','is_genai_role': is_genai_role }def save_classification(self, job_id: str, result: Dict) ->None:"""Save classification results to file""" filename = os.path.join(self.output_dir, f"{job_id}.json")withopen(filename, 'w') as f: json.dump(result, f, indent=2) logging.info(f"Saved classification for job {job_id}")def get_classified_jobs(self) -> Set[str]:"""Get set of already classified job IDs"""ifnot os.path.exists(self.output_dir):returnset()return {f[:-5] for f in os.listdir(self.output_dir) if f.endswith('.json')}asyncdef process_jobs_batch(self, jobs: List[Dict]) ->None:"""Process a batch of jobs concurrently"""asyncdef process_with_semaphore(job: Dict) ->None:asyncwithself.semaphore: result =awaitself.classify_job(job) job_id =str(result['filename'])self.save_classification(job_id, result)await asyncio.gather(*[process_with_semaphore(job) for job in jobs])asyncdef classify_jobs_async(self, df: pd.DataFrame) ->None:"""Process all unclassified jobs in the DataFrame""" total_jobs =len(df) logging.info(f"Starting classification of {total_jobs} jobs") classified_jobs =self.get_classified_jobs() jobs_to_process = [ job.to_dict() for idx, job in df.iterrows() ifstr(job['filename']) notin classified_jobs ]awaitself.process_jobs_batch(jobs_to_process) logging.info(f"Completed classification of all {total_jobs} jobs")def classify_jobs(self, df: pd.DataFrame) ->None:"""Main entry point for job classification"""if df.empty: logging.warning("Empty DataFrame provided")returnif'filename'notin df.columns: logging.error("DataFrame missing required 'filename' column")return classified_jobs =self.get_classified_jobs() logging.info(f"Found {len(classified_jobs)} previously classified jobs") new_jobs = df[~df['filename'].isin(classified_jobs)]if new_jobs.empty: logging.info("No new jobs to classify")return logging.info(f"Processing {len(new_jobs)} new jobs") logging.info(f"Skipping {len(df) -len(new_jobs)} existing jobs") loop = asyncio.get_event_loop() loop.run_until_complete(self.classify_jobs_async(new_jobs))class JobDataLoader:"""Handles loading and preprocessing of job data"""@staticmethoddef read_json_files(json_dir='json_extracted_data') -> List[Dict]:"""Read job data from JSON files""" result = []for filename in os.listdir(json_dir):if filename.endswith('.json'): file_path = os.path.join(json_dir, filename)try:withopen(file_path, 'r') as f: data = json.load(f) name = filename[:-5]if'responsibilities'in data and'skills'in data: result.append({'filename': name,'responsibilities': data['responsibilities'],'skills': data['skills'] })except json.JSONDecodeError: logging.error(f"Invalid JSON in {filename}")exceptExceptionas e: logging.error(f"Error processing {filename}: {str(e)}")return result@staticmethoddef load_classifications(input_dir='role_genai_classifications') -> pd.DataFrame:"""Load classification results into DataFrame"""ifnot os.path.exists(input_dir):return pd.DataFrame() all_results = []for filename in os.listdir(input_dir):if filename.endswith('.json'):withopen(os.path.join(input_dir, filename), 'r') as f: classification = json.load(f) all_results.append(classification)return pd.DataFrame(all_results)
Classify GenAI/LLM jobs into pre-defined categories
For this next classification task, I’ll make the assumption that there are four types of AI engineering and science roles that are relevant to work with GenAI systems. These are:
AI Research Scientist
AI/ML Engineer
MLOps / AI Infrastructure Engineer
AI Solution Architect
I’ll also include other categories that are not of interest, but may improve classification accuracy. These are:
Data Scientist
Data Engineer
Product Manager
Software Engineer
The code below implements an automated job classification system that uses an LLM to categorize job postings into the eight predefined roles listed above (four GenAI-focused and four related roles). It consists of several classes that work together: JobClassifier handles the core classification logic by comparing job descriptions against detailed role templates, JobData and ClassificationResult provide structured data containers, and JobProcessor manages the overall pipeline from loading jobs to saving results. The system processes jobs concurrently using asyncio, includes error handling and rate limiting, and outputs both an analysis explaining the classification and a final numerical category (0-8) for each job, with all results saved as JSON files for further analysis.
Definitions of the roles can be found in the JOB_DESCRIPTIONS variable.
Click to view the code for job classification into pre-defined roles
from dataclasses import dataclassfrom typing import List, Dict, Set, Optionalimport loggingimport jsonimport asyncioimport osfrom bs4 import BeautifulSoupimport pandas as pdimport nest_asyncioJOB_DESCRIPTIONS ="""<Option title="AI Research Scientist" number="1"> <PrimaryFocus> Investigate and adapt cutting-edge AI methodologies (e.g., generative models, advanced prompt engineering) for applications. </PrimaryFocus> <KeyResponsibilities> Conduct experiments to evaluate the performance (e.g., quality, accuracy) of new AI approaches and refine existing models. Collaborate with AI/ML Engineers to transition successful prototypes into production. Stay current with the latest AI research and emerging trends in generative AI. Develop human-annotated datasets for training and evaluation of AI models. </KeyResponsibilities> <SkillsAndTools> Deep understanding of LLMs and prompt engineering. Strong background in statistics, optimization, or related fields. Knowledge of experimental methods (e.g., A/B testing) and hypothesis testing. Knowledge of LLM evaluation methods, including algorithmic evals, human evals, or LLM-as-a-judge evals. </SkillsAndTools></Option><Option title="AI/ML Engineer" number="2"> <PrimaryFocus> Transform research output into robust, scalable AI solutions for the product or internal use. </PrimaryFocus> <KeyResponsibilities> Productionize AI models, ensuring they meet performance and reliability requirements. Develop and maintain data pipelines for model training, inference, and monitoring. Collaborate closely with Research Scientists to optimize and refine model implementations. </KeyResponsibilities> <SkillsAndTools> Proficiency in Python, Go, or similar languages. Experience with API development and integration (REST, GraphQL). Working knowledge of software engineering best practices (version control, testing, CI/CD). </SkillsAndTools></Option><Option title="MLOps / AI Infrastructure Engineer" number="3"> <PrimaryFocus> Ensure reliable deployment, scaling, and monitoring of AI systems in production. </PrimaryFocus> <KeyResponsibilities> Set up CI/CD pipelines tailored for AI workflows, including model versioning and data governance. Monitor production models for performance, latency, and data drift, implementing necessary updates. Manage infrastructure for scalable AI deployments (Docker, Kubernetes, cloud services). </KeyResponsibilities> <SkillsAndTools> Strong DevOps background, with tools like Docker, Kubernetes, and Terraform. Familiarity with ML orchestration/monitoring tools (MLflow, Airflow, Prometheus). Experience optimizing compute usage (GPU/TPU) for cost-effective scaling. </SkillsAndTools></Option><Option title="AI Solution Architect" number="4"> <PrimaryFocus> Design and orchestrate AI solutions leveraging generative models and LLM technologies to create impactful experiences and solutions that align with business objectives. </PrimaryFocus> <KeyResponsibilities> Collaborate with subject matter experts (SMEs) to identify and refine opportunities for generative AI/LLM-based use cases. Assess feasibility and define high-level solution architectures, ensuring they address core business and user requirements. Develop technical proposals and roadmaps, translating complex requirements into actionable plans. Provide thought leadership on conversational design, user experience flow, and model interaction strategies. Ensure solutions comply with relevant data governance, privacy, and security considerations. Facilitate cross-functional collaboration, guiding teams through solution conceptualization and implementation phases. </KeyResponsibilities> <SkillsAndTools> Strong understanding of LLM capabilities and prompt engineering principles. Experience with conversational experience design (e.g., chatbots, voice interfaces) and user journey mapping. Ability to analyze business needs and translate them into feasible AI solution proposals. Familiarity with data privacy and security best practices, especially as they pertain to AI solutions. Excellent communication and stakeholder management skills to align technical and non-technical teams. </SkillsAndTools></Option><Option title="Data Scientist" number="5"> <PrimaryFocus> Leverage statistical analysis, machine learning, and data visualization to derive actionable insights and guide data-informed decisions. </PrimaryFocus> <KeyResponsibilities> Perform exploratory data analysis (EDA) to identify trends and patterns in large, complex datasets. Develop and validate predictive and prescriptive models, collaborating with cross-functional teams to implement these solutions. Design and execute experiments to test hypotheses, measure impact, and inform business strategies. Present findings and recommendations to stakeholders in a clear, concise manner using visualizations and dashboards. Work with data engineers to ensure data quality, governance, and availability. </KeyResponsibilities> <SkillsAndTools> Proficiency in Python, R, or SQL for data manipulation and analysis. Experience with common ML libraries (e.g., scikit-learn, XGBoost) and deep learning frameworks (e.g., PyTorch, TensorFlow). Solid grounding in statistics, probability, and experimental design. Familiarity with data visualization tools (e.g., Tableau, Power BI) for communicating insights. Strong analytical thinking and ability to translate complex data problems into business solutions. </SkillsAndTools> </Option><Option title="Data Engineer" number="6"> <PrimaryFocus> Design, build, and maintain scalable data pipelines and architectures that enable efficient data collection, storage, and analysis. </PrimaryFocus> <KeyResponsibilities> Develop and optimize data ingestion and transformation processes (ETL/ELT), ensuring high performance and reliability. Implement and manage data workflows, integrating internal and external data sources. Collaborate with Data Scientists, AI/ML Engineers, and other stakeholders to ensure data readiness for analytics and model training. Monitor data pipelines for performance, reliability, and cost-effectiveness, taking corrective actions when needed. Maintain data quality and governance standards, including metadata management and data cataloging. </KeyResponsibilities> <SkillsAndTools> Proficiency in Python, SQL, and distributed data processing frameworks (e.g., Spark, Kafka). Experience with cloud-based data ecosystems (AWS, GCP, or Azure), and related storage/processing services (e.g., S3, BigQuery, Dataflow). Familiarity with infrastructure-as-code and DevOps tools (Terraform, Docker, Kubernetes) for automating data platform deployments. Strong understanding of database systems (relational, NoSQL) and data modeling principles. Knowledge of data orchestration and workflow management tools (Airflow, Luigi, Dagster). </SkillsAndTools></Option><Option title="Product Manager" number="7"> <PrimaryFocus> Drive the product vision and strategy, ensuring alignment with business goals and user needs while delivering impactful AI-driven solutions. </PrimaryFocus> <KeyResponsibilities> Conduct user and market research to identify opportunities, define product requirements, and set success metrics. Collaborate with cross-functional teams (Engineering, Data Science, Design) to prioritize features and plan releases. Develop and communicate product roadmaps, ensuring stakeholders are aligned on goals and timelines. Monitor product performance through data analysis and user feedback, iterating on improvements and new feature ideas. Facilitate agile development practices, writing clear user stories and acceptance criteria. </KeyResponsibilities> <SkillsAndTools> Strong understanding of product lifecycle management and agile methodologies (Scrum/Kanban). Excellent communication, negotiation, and stakeholder management skills. Experience with product management and collaboration tools (e.g., Jira, Confluence, Trello). Analytical mindset for leveraging metrics, A/B testing, and user feedback in decision-making. Familiarity with AI/ML concepts and the ability to translate technical possibilities into viable product features. </SkillsAndTools></Option><Option title="Software Engineer" number="8"> <PrimaryFocus> Design, develop, and maintain high-quality software applications and services that address user needs and align with overall business objectives. </PrimaryFocus> <KeyResponsibilities> Collaborate with cross-functional teams (Product, Design, QA) to interpret requirements and deliver robust solutions. Write clean, efficient, and testable code following best practices and coding standards. Participate in system architecture and design discussions, contributing to the evolution of technical roadmaps. Perform code reviews and provide constructive feedback to peers, maintaining a high bar for code quality. Implement and maintain CI/CD pipelines to streamline deployment and reduce manual interventions. Continuously improve system performance and scalability through profiling and optimization. </KeyResponsibilities> <SkillsAndTools> Proficiency in one or more programming languages (e.g., Java, Python, JavaScript, C++). Experience with modern frameworks/libraries (e.g., Spring Boot, Node.js, React, Django). Solid understanding of software design principles (e.g., SOLID, DRY) and architectural patterns (e.g., microservices). Familiarity with version control systems (Git), testing frameworks, and agile methodologies. Working knowledge of containerization (Docker), orchestration (Kubernetes), and cloud platforms (AWS, Azure, GCP). </SkillsAndTools></Option>"""PROMPTS = {"prompt": """You will be given a list of Responsibilities andSkills listed for a job. Your task is to determineif the job is a good fit with any of the Options,and if so, which one.<Job><Responsibilities>{responsibilities}</Responsibilities><Skills>{skills}</Skills></Job><Options>{Options}</Options>Start by thinking step-by-step about the Job and its Responsibilities and Skills, in relation to each of the Options.Decide if the Job is a good fit with ANY of the Options.If NONE of the Options are relevant to the Job, say so and return a 0 as your FinalAnswer.Otherwise, decide which of the Options is the most similarto the Job and return its number as your FinalAnswer.Return your response in the following format:<Analysis>[Your analysis of the job and its Responsibilities and Skills, in relation each of the Options]</Analysis><FinalAnswer>0|1|2|3|4|5|6|7|8</FinalAnswer>"""}# Enable nested event loopsnest_asyncio.apply()@dataclassclass JobData:"""Represents a job posting with extracted information.""" filename: str responsibilities: List[str] skills: List[str]@dataclassclass ClassificationResult:"""Represents the result of a job classification.""" filename: str responsibilities: List[str] skills: List[str] analysis: str role_classification: Optional[int] role_title: Optional[str]class JobClassifier:"""Handles classification of jobs into predefined roles."""def__init__(self, output_dir: str='role_classifications', batch_size: int=3):self.output_dir = output_dirself.batch_size = batch_sizeself.semaphore = asyncio.Semaphore(batch_size) os.makedirs(output_dir, exist_ok=True)asyncdef classify_job(self, job: JobData) -> ClassificationResult:"""Classify a single job listing.""" logging.info(f"Classifying job '{job.filename}'") prompt = PROMPTS["prompt"].format( responsibilities=job.responsibilities, # Access as attribute instead of dict skills=job.skills, # Access as attribute instead of dict Options=JOB_DESCRIPTIONS )try: response =await get_completion(prompt)await asyncio.sleep(5) # Rate limitingreturnself._parse_response(job, response)exceptExceptionas e: logging.error(f"Failed to classify job '{job.filename}': {str(e)}")return ClassificationResult( filename=job.filename, responsibilities=job.responsibilities, skills=job.skills, analysis=f'Failed to process: {str(e)}', role_classification=None, role_title=None )def _parse_response(self, job: JobData, response: str) -> ClassificationResult:"""Parse LLM response into structured format.""" soup = BeautifulSoup(f"<root>{response}</root>", 'lxml-xml') analysis = soup.find('Analysis') role_choice = soup.find('FinalAnswer') role_number =int(role_choice.text.strip()) if role_choice elseNone role_title =self._get_role_title(role_number)return ClassificationResult( filename=job.filename, responsibilities=job.responsibilities, skills=job.skills, analysis=analysis.text.strip() if analysis else'', role_classification=role_number, role_title=role_title )def _get_role_title(self, role_number: Optional[int]) -> Optional[str]:"""Get the title for a role number."""if role_number isNone:returnNoneif role_number ==0:return"Other" wrapped_xml =f"<root>{JOB_DESCRIPTIONS}</root>" job_descriptions_soup = BeautifulSoup(wrapped_xml, 'lxml-xml') matching_job = job_descriptions_soup.find('Option', {'number': str(role_number)})if matching_job:return matching_job['title'] logging.error(f"No matching job found for role number {role_number}")returnNoneclass JobProcessor:"""Handles the processing of job data files."""def__init__(self, input_dir: str='json_extracted_data'):self.input_dir = input_dirself.classifier = JobClassifier()def load_jobs(self) -> List[JobData]:"""Load jobs from JSON files.""" jobs = []# Get list of all JSON files total_files =len([f for f in os.listdir(self.input_dir) if f.endswith('.json')])# Get files that were classified as GenAI-relevant classified_files =self._get_classified_files()# Get files that have already been processed processed_files = { f[:-5] for f in os.listdir(self.classifier.output_dir) if f.endswith('.json') }# Get relevant files that haven't been processed yet files_to_process = classified_files - processed_files logging.info(f"Found {total_files} total files") logging.info(f"Found {len(classified_files)} relevant GenAI files") logging.info(f"Already processed: {len(processed_files)} files") logging.info(f"Remaining to process: {len(files_to_process)} files")# Only process files that are both relevant and unprocessedfor filename in os.listdir(self.input_dir):ifnot filename.endswith('.json'):continue name = filename[:-5]if name notin files_to_process:continuetry: job =self._load_job_file(filename)if job: jobs.append(job)exceptExceptionas e: logging.error(f"Error processing {filename}: {str(e)}") logging.info(f"Processing {len(jobs)} remaining jobs")return jobsdef _load_job_file(self, filename: str) -> Optional[JobData]:"""Load and parse a single job file.""" file_path = os.path.join(self.input_dir, filename)try:withopen(file_path, 'r') as f: data = json.load(f)if'responsibilities'in data and'skills'in data:return JobData( filename=filename[:-5], responsibilities=data['responsibilities'], skills=data['skills'] )except json.JSONDecodeError: logging.error(f"Invalid JSON in {filename}")returnNonedef _get_classified_files(self) -> Set[str]:"""Get set of files that have been previously classified as GenAI-related.""" genai_dir ='role_genai_classifications'ifnot os.path.exists(genai_dir):returnset() genai_files =set()for f in os.listdir(genai_dir):if f.endswith('.json'):try:withopen(os.path.join(genai_dir, f), 'r') asfile: data = json.load(file)if data.get('is_genai_role') isTrue: # Explicitly check for True genai_files.add(f[:-5])exceptExceptionas e: logging.error(f"Error reading {f}: {e}")return genai_filesasyncdef process_jobs(self) ->None:"""Process all jobs.""" jobs =self.load_jobs()ifnot jobs:return logging.info(f"Processing {len(jobs)} relevant jobs")awaitself._process_jobs_batch(jobs) logging.info("Job classification complete")asyncdef _process_jobs_batch(self, jobs: List[JobData]) ->None:"""Process a batch of jobs concurrently."""asyncdef process_with_semaphore(job: JobData) ->None:asyncwithself.classifier.semaphore: result =awaitself.classifier.classify_job(job)self._save_result(result)await asyncio.gather(*[process_with_semaphore(job) for job in jobs])def _save_result(self, result: ClassificationResult) ->None:"""Save classification result to file.""" filename = os.path.join(self.classifier.output_dir, f"{result.filename}.json")withopen(filename, 'w') as f: json.dump(vars(result), f, indent=2) logging.info(f"Saved classification for job {result.filename}")def classify_job_roles(df: pd.DataFrame) ->None:"""Main entry point for job classification."""if df.empty: logging.warning("Empty DataFrame provided")return processor = JobProcessor() loop = asyncio.get_event_loop() loop.run_until_complete(processor.process_jobs())
2025-01-24 13:50:53,982 - INFO - Found 739 total files
2025-01-24 13:50:53,982 - INFO - Found 247 relevant GenAI files
2025-01-24 13:50:53,983 - INFO - Already processed: 247 files
2025-01-24 13:50:53,983 - INFO - Remaining to process: 0 files
2025-01-24 13:50:53,984 - INFO - Processing 0 remaining jobs
Load final data
Now I’ll load and combine all of the different data outputs into a single dataframe.
The code block below uses a DataLoader class with helper classes DataPaths and SalaryProcessor to handle the data processing pipeline. The code merges job posting data with role classifications, processes salary information by converting CAD to USD and extracting salary ranges, and filters the results to only include the four specific AI-related roles of interest: AI Research Scientist, AI Solution Architect, AI/ML Engineer, and MLOps/AI Infrastructure Engineer. The code includes error handling for JSON file loading and salary processing. The final output is a pandas DataFrame containing the merged and processed data, which includes job details, role classifications, and standardized salary information.
Click to view the code for loading and merging all data sources
from typing import List, Tuple, Optional, Dictimport pandas as pdimport jsonimport hashlibfrom dataclasses import dataclassfrom pathlib import Path@dataclassclass DataPaths:"""Configuration for data file paths""" ROLE_CLASSIFICATIONS_DIR: Path = Path('role_classifications') JSON_DATA_DIR: Path = Path('json_data')class DataLoader:def__init__(self, paths: DataPaths = DataPaths()):self.paths = pathsself.salary_processor = SalaryProcessor()def load_and_merge_data(self) -> pd.DataFrame:"""Load and merge all data sources"""# Load and merge base data merged_df =self._load_base_data()# Process salaries salary_data =self.salary_processor.extract_salary_ranges(merged_df['salary']) merged_df = pd.concat([merged_df, salary_data], axis=1)# Filter out any jobs not in this list merged_df = merged_df[merged_df['role_title'].isin( ['AI Research Scientist','AI Solution Architect','AI/ML Engineer','MLOps / AI Infrastructure Engineer'] )]print(f"Shape of merged DataFrame: {merged_df.shape}")return merged_dfdef _load_base_data(self) -> pd.DataFrame:"""Load and merge base data sources"""# Load role classifications role_data = pd.DataFrame(self._load_json_files(self.paths.ROLE_CLASSIFICATIONS_DIR))# Load and process jobs data jobs_data = pd.DataFrame(self._load_json_files(self.paths.JSON_DATA_DIR)) jobs_data['filename'] = jobs_data['url'].apply(lambda x: f"j{hashlib.md5(x.encode()).hexdigest()[:5]}" )# Merge dataframes merged_df = pd.merge(jobs_data, role_data, on='filename', how='left')return merged_dfdef _load_json_files(self, directory: Path) -> List[Dict]:"""Load JSON files from directory""" json_files = []for file_path in directory.glob('*.json'):try:withopen(file_path, 'r') as f: json_files.append(json.load(f))exceptExceptionas e:print(f"Error reading {file_path}: {e}")return json_filesclass SalaryProcessor:"""Handles salary-related processing""" CAD_TO_USD_RATE =1.44# 1 USD = 1.44 CADdef extract_salary_ranges(self, salary_series: pd.Series) -> pd.DataFrame:"""Extract salary ranges from a series of salary strings""" salary_data = salary_series.apply(self._extract_salary_range)return pd.DataFrame({'min_salary': salary_data.apply(lambda x: x[0]),'max_salary': salary_data.apply(lambda x: x[1]),'mid_salary': salary_data.apply(lambda x: (x[0] + x[1])/2if x[0] and x[1] elseNone) })def _extract_salary_range(self, salary_str: str) -> Tuple[Optional[float], Optional[float]]:"""Extract minimum and maximum salary from salary string"""try:ifnotisinstance(salary_str, str) or'0+'in salary_str:returnNone, None# Determine currency and convert if needed is_cad ='CAD'in salary_str nums = (salary_str.replace('CAD ', '') .replace('USD ', '') .replace('K', '') .split(' - '))iflen(nums) !=2:returnNone, None# Convert to float and apply CAD conversion if needed min_salary =float(nums[0]) max_salary =float(nums[1])if is_cad: min_salary *=self.CAD_TO_USD_RATE max_salary *=self.CAD_TO_USD_RATEreturn min_salary, max_salaryexceptException:returnNone, None
# Initialize the loader and load the data# Check if the csv file existsifnot os.path.exists('genai_job_data.csv'): loader = DataLoader() df = loader.load_and_merge_data() df.to_csv('genai_job_data.csv', index=False)else: df = pd.read_csv('genai_job_data.csv')# Now df contains all the merged data with processed salariesprint(f"Total jobs loaded: {len(df)}")# Example operations you can do with the DataFrame:# View available columnsprint("\nAvailable columns:")print(df.columns.tolist())# View distribution of rolesprint("\nRole distribution:")print(df['role_title'].value_counts())
Total jobs loaded: 187
Available columns:
['title', 'company', 'location', 'level', 'salary', 'url', 'description', 'scraped_date', 'posted_date', 'raw_data', 'filename', 'responsibilities', 'skills', 'analysis', 'role_classification', 'role_title', 'min_salary', 'max_salary', 'mid_salary']
Role distribution:
role_title
AI/ML Engineer 60
AI Research Scientist 52
AI Solution Architect 48
MLOps / AI Infrastructure Engineer 27
Name: count, dtype: int64
Salary analysis
First I’ll analyze the salary data for each role and experience level. I’ll use a RandomForest model to adjust for location differences, and convert any CAD denominated salaries to USD. I’m interested to see what compensation is like for each role, and if there are any significant differences in compensation between the roles.
The code below implements a salary analysis system for the job posting data. It processes salary information across different roles and experience levels, handling currency conversions (CAD to USD) and adjusting for location differences using a RandomForest model. For each role and level combination, it calculates salary ranges, removes statistical outliers, and provides sample sizes. The system includes configuration settings for analysis parameters and comprehensive error handling. The code uses pandas for data manipulation and scikit-learn for the location adjustment model, with results formatted as salary ranges in USD.
Click to view the code for salary analysis
from dataclasses import dataclassfrom typing import Dict, List, Optional, TypedDict, Setimport loggingimport pandas as pdfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.model_selection import cross_val_score@dataclassclass SalaryConfig:"""Configuration settings for salary analysis""" outlier_threshold: float=1.5 n_estimators: int=100 min_group_size: int=3 currency_symbol: str='$' cv_folds: int=5class SalaryColumns(TypedDict):"""Type definitions for salary DataFrame columns""" min_salary_usd_k: float max_salary_usd_k: float median_salary_usd_k: float sample_size: int location_adjusted_zscore: floatclass SalaryAnalyzer:"""Analyzes salary distributions across role titles. Column Naming Conventions: - *_usd_k: Values in thousands of USD - *_zscore: Standardized scores (mean=0, std=1) - sample_size: Number of data points in group - *_median: Median values - location_adjusted_*: Values adjusted for location differences Attributes: config: Configuration settings for analysis logger: Logger instance for the class """# Input column names COL_MIN_SALARY ='min_salary' COL_MAX_SALARY ='max_salary' COL_ROLE_TITLE ='role_title' COL_LEVEL ='level' COL_LOCATION ='location'# Processed column names COL_MEDIAN_SALARY ='median_salary_usd_k' COL_MIN_SALARY_MEDIAN ='min_salary_median_usd_k' COL_MAX_SALARY_MEDIAN ='max_salary_median_usd_k' COL_SAMPLE_SIZE ='sample_size'# Z-score columns COL_MIN_ZSCORE ='min_salary_location_adjusted_zscore' COL_MID_ZSCORE ='median_salary_location_adjusted_zscore' COL_MAX_ZSCORE ='max_salary_location_adjusted_zscore'def__init__(self, config: Optional[SalaryConfig] =None):"""Initialize SalaryAnalyzer with optional configuration."""self.config = config or SalaryConfig()self.logger = logging.getLogger(__name__)self._model =None# Cache for trained modeldef analyze_role_salaries(self, df: pd.DataFrame) -> pd.DataFrame:""" Analyze salary distributions across role titles with location adjustment. Args: df: DataFrame containing columns: - min_salary: Minimum salary (float) - max_salary: Maximum salary (float) - role_title: Job role (str) - level: Experience level (str) - location: Job location (str) Returns: DataFrame with aggregated salary statistics by role and level Raises: ValueError: If required columns are missing or DataFrame is empty """self._validate_input(df)self.logger.info(f"Starting salary analysis with {len(df)} records")# Create working copy of DataFrame analysis_df =self._prepare_data(df)# Process and clean data analysis_df =self._process_data(analysis_df)# Generate final results results =self._aggregate_results(analysis_df)self.logger.info("Salary analysis completed successfully")return resultsdef _validate_input(self, df: pd.DataFrame) ->None:"""Validate input DataFrame."""if df isNoneor df.empty:raiseValueError("Input DataFrame cannot be None or empty") required_cols = {self.COL_MIN_SALARY, self.COL_MAX_SALARY, self.COL_ROLE_TITLE,self.COL_LEVEL,self.COL_LOCATION } missing_cols = required_cols -set(df.columns)if missing_cols:raiseValueError(f"DataFrame missing required columns: {missing_cols}")def _prepare_data(self, df: pd.DataFrame) -> pd.DataFrame:"""Prepare data for analysis by filtering valid salaries.""" valid_salary_mask = ( df[self.COL_MIN_SALARY].notna() & df[self.COL_MAX_SALARY].notna() )ifnot valid_salary_mask.any():self.logger.warning("No valid salary data found")return pd.DataFrame()# Create working copy and calculate median salary analysis_df = df[valid_salary_mask].copy() analysis_df[self.COL_MEDIAN_SALARY] = ( analysis_df[[self.COL_MIN_SALARY, self.COL_MAX_SALARY]].mean(axis=1) )return analysis_dfdef _remove_outliers(self, group: pd.DataFrame) -> pd.DataFrame:"""Remove statistical outliers from salary data."""iflen(group) <=self.config.min_group_size:return group# Only check min and max salary for outliers salary_cols = [self.COL_MIN_SALARY, self.COL_MAX_SALARY]for col in salary_cols: Q1 = group[col].quantile(0.25) Q3 = group[col].quantile(0.75) IQR = Q3 - Q1 outlier_mask =~( (group[col] < (Q1 -self.config.outlier_threshold * IQR)) | (group[col] > (Q3 +self.config.outlier_threshold * IQR)) ) group = group[outlier_mask]# Recalculate median after removing outliers group[self.COL_MEDIAN_SALARY] = ( group[self.COL_MIN_SALARY] + group[self.COL_MAX_SALARY] ) /2return groupdef _aggregate_results(self, df: pd.DataFrame) -> pd.DataFrame:"""Aggregate and format the final results."""# First calculate group statistics grouped = df.groupby([self.COL_ROLE_TITLE, self.COL_LEVEL]).agg({self.COL_MIN_SALARY: ['count', 'median'],self.COL_MAX_SALARY: 'median', }).round(0)# Calculate median salary from min and max medians grouped['median_salary'] = ( grouped[(self.COL_MIN_SALARY, 'median')] + grouped[(self.COL_MAX_SALARY, 'median')] ) /2# Flatten and rename columns grouped.columns = ['sample_size'if col == (self.COL_MIN_SALARY, 'count')else'min_salary_usd_k'if col == (self.COL_MIN_SALARY, 'median')else'max_salary_usd_k'if col == (self.COL_MAX_SALARY, 'median')else'median_salary_usd_k'if col[0] =='median_salary'else colfor col in grouped.columns ]# Order columns ordered_cols = ['sample_size','min_salary_usd_k','max_salary_usd_k','median_salary_usd_k' ] result = grouped.reindex(columns=ordered_cols)# Format salary values salary_cols = [col for col in ordered_cols if col.endswith('_usd_k')]for col in salary_cols: result[col] = result[col].apply(lambda x: f"{self.config.currency_symbol}{x:,.0f}K"if pd.notna(x) else"N/A" )return resultdef _process_data(self, df: pd.DataFrame) -> pd.DataFrame:"""Process and clean the salary data."""# Remove outliers by group df = (df.groupby([self.COL_ROLE_TITLE, self.COL_LEVEL]) .apply(self._remove_outliers) .reset_index(drop=True))# Add location-adjusted scores df =self._add_location_adjusted_scores(df)# Clean level names df[self.COL_LEVEL] = (df[self.COL_LEVEL] .str.replace('-Level / Level', '', regex=False) .str.replace('-level', '', regex=False))return dfdef _train_model(self) -> RandomForestRegressor:"""Train and cache RandomForest model."""ifself._model isNone:self._model = RandomForestRegressor( n_estimators=self.config.n_estimators, random_state=42 )returnself._modeldef _add_location_adjusted_scores(self, df: pd.DataFrame) -> pd.DataFrame:"""Add location-adjusted z-scores to the dataframe.""" location_dummies = pd.get_dummies(df[self.COL_LOCATION], prefix='loc') model =self._train_model() score_columns = {self.COL_MIN_SALARY: self.COL_MIN_ZSCORE,self.COL_MEDIAN_SALARY: self.COL_MID_ZSCORE,self.COL_MAX_SALARY: self.COL_MAX_ZSCORE }for source_col, target_col in score_columns.items(): df[target_col] =self._adjust_salaries( df[source_col], location_dummies, model )return dfdef _adjust_salaries(self, salary_series: pd.Series, X: pd.DataFrame, model: RandomForestRegressor) -> pd.Series:"""Adjust salaries using RandomForest model."""# Evaluate model performance scores = cross_val_score( model, X, salary_series, cv=self.config.cv_folds )self.logger.debug(f"Cross-validation scores: {scores.mean():.3f} ± {scores.std():.3f}")# Fit model and calculate residuals model.fit(X, salary_series) expected = model.predict(X) residuals = salary_series - expected# Return standardized residualsreturn (residuals - residuals.mean()) / residuals.std()
2025-01-24 13:50:54,315 - INFO - Starting salary analysis with 187 records
2025-01-24 13:50:56,541 - INFO - Salary analysis completed successfully
sample_size
min_salary_usd_k
max_salary_usd_k
median_salary_usd_k
role_title
level
AI Research Scientist
Entry
3
$140K
$170K
$155K
Mid
8
$122K
$209K
$166K
Senior
24
$159K
$251K
$205K
AI Solution Architect
Entry
2
$133K
$266K
$200K
Mid
7
$120K
$206K
$163K
Senior
33
$146K
$247K
$196K
AI/ML Engineer
Entry
1
$135K
$251K
$193K
Mid
11
$129K
$224K
$176K
Senior
28
$158K
$268K
$213K
MLOps / AI Infrastructure Engineer
Entry
2
$152K
$260K
$206K
Mid
5
$143K
$214K
$178K
Senior
16
$164K
$258K
$211K
Looking across all the four GenAI roles (AI Research, Solutions Architecture, ML Engineering, and MLOps), compensation bands are similar. At senior levels (where I have the largest samples), median salaries cluster tightly between ~$195K-$210K USD. Mid-level positions show medians between ~$165K-$180K USD, and entry-level positions generally start between ~$155K-$205K USD, although this entry-level data is limited by small samples.
This consistency suggests these roles are valued similarly in the market, despite their different focuses.
Job responsibilities
Next, I’m interested to see what the job responsibilities are for some of these roles, and how they differ. I’ll use an LLM to identify the most common responsibilities and skills for a given role. This task will be quite complex, involving a lot of input data. So a model like o1 will be more appropriate. I don’t have access to o1 via API, so I’ll use the web interface to generate completions, and prepare XML files for the inputs to this task – allowing me to copy-paste the XML into the chat.
As we’ll see, Research Scientists lean heavily into discovery, theoretical advances, publication, and cutting-edge experimentation, whereas ML Engineers center on production-grade systems, robust architecture, MLOps, and aligning with business needs. Although in many cases, the lines between these roles will be blurred.
Click to view the code for XML generation
import loggingimport osfrom typing import Dictimport pandas as pdINSTRUCTION_PREFIX ="""Identify the most common Responsibilities and Skills listed in the jobs below, returning a bulleted list in the format# Responsibilities- [responsibility]: [responsibility description]- [responsibility]: [responsibility description]...# Skills- [skill]: [skill description]- [skill]: [skill description]..."""class XMLGenerator:"""Handles generation of XML files from job data"""def__init__(self, output_dir: str='xml_roles'):self.output_dir = output_dir os.makedirs(output_dir, exist_ok=True) logging.basicConfig( level=logging.INFO,format='%(asctime)s - %(levelname)s - %(message)s' )def compile_xml_for_role(self, level: str, role_title: str, df: pd.DataFrame) -> Dict:"""Generate XML-formatted string of job details for the given level and role title."""# Filter DataFrame for matching level and role filtered_df = df[ (df['level'] == level) & (df['role_title'] == role_title) ]iflen(filtered_df) ==0:return {'level': level,'role_title': role_title,'xml_output': '','error': f"No jobs found for {level}{role_title}" }# Build XML string xml_output =""for _, job in filtered_df.iterrows(): xml_output +="<Job>\n" xml_output +=f"<Responsibilities>{job['responsibilities']}</Responsibilities>\n" xml_output +=f"<Skills>{job['skills']}</Skills>\n" xml_output +="</Job>\n"return {'level': level,'role_title': role_title,'xml_output': INSTRUCTION_PREFIX + xml_output }def save_xml(self, level: str, role_title: str, xml_data: Dict) ->None:"""Save XML data to a file"""ifnot xml_data['xml_output']:#logging.warning(xml_data.get('error', 'Empty XML output'))return# Clean and sanitize the filename components cleaned_level = level.replace('-Level / Level', '').strip() safe_level = cleaned_level.replace('/', '_').replace(' ', '_') safe_role = role_title.replace('/', '_').replace(' ', '_') filename = os.path.join(self.output_dir, f"{safe_level}_{safe_role}.xml")withopen(filename, 'w') as f: f.write(xml_data['xml_output']) logging.debug(f"Saved XML for {cleaned_level}{role_title} to {filename}")def generate_all_xml(self, df: pd.DataFrame) ->None:"""Generate XML files for all role/level combinations"""if df.empty: logging.warning("No jobs found in DataFrame - check if data was loaded correctly")return required_cols = ['level', 'role_title', 'responsibilities', 'skills'] # Changed here tooifnotall(col in df.columns for col in required_cols): logging.error(f"DataFrame missing required columns: {required_cols}")return# Process each unique role/level combination role_combinations = df[['level', 'role_title']].drop_duplicates()for _, combo in role_combinations.iterrows(): xml_data =self.compile_xml_for_role( combo['level'], combo['role_title'], df )self.save_xml(combo['level'], combo['role_title'], xml_data)
Here’s what o1 said about the Responsibilities and Skills observed in Senior AI Research Scientist roles:
Responsibilities
Conduct advanced AI research
Many roles require pushing the state-of-the-art in Generative AI, LLMs, and related areas (e.g., video/multimodal models, diffusion models) through novel algorithms, architectures, and experimental studies.
Train and fine-tune large-scale models
Commonly involves working with massive datasets and distributed training setups (thousands of GPUs, HPC environments) to develop foundation models and advanced AI systems.
Develop and implement new algorithms or architectures
Spans designing novel model architectures (e.g., diffusion, transformer-based, multimodal fusion) and creating robust data processing or simulation pipelines to support AI solutions.
Collaborate with cross-functional teams
Emphasizes close work with engineering, product management, research, and external stakeholders to integrate AI breakthroughs into real-world applications and products.
Evaluate and measure AI performance
Entails building rigorous evaluation frameworks, designing new metrics, and systematically analyzing model behavior to ensure quality and reliability.
Publish and communicate research findings
Many positions highlight writing influential papers, presenting at conferences, and sharing innovative results both internally and with the broader AI community.
Build and maintain data pipelines
Involves constructing high-quality, scalable data pipelines or tooling to support training, fine-tuning, and inference of large models.
Ensure production-grade implementation
Requires writing clean, efficient, and maintainable code as well as optimizing models and pipelines to meet performance, reliability, and quality standards.
Skills
Proficiency in Python and deep learning frameworks
Strong coding skills in Python and hands-on experience with libraries such as PyTorch, TensorFlow, or JAX appear in nearly every role.
Expertise with LLMs and Generative AI
Deep understanding of transformer architectures, diffusion models, multimodal systems, prompt engineering, and other advanced AI techniques is frequently mentioned.
Experience with large-scale/distributed training
Many roles emphasize knowledge of HPC, GPU optimization, model parallelism (e.g., FSDP, DeepSpeed, Megatron-LM), and handling massive datasets.
Strong software engineering practices
Testing, code review, debugging, version control, and producing clean, modular research or production code are consistently important.
Collaboration and communication skills
Clear written and verbal communication, along with cross-functional teamwork, is vital for integrating AI solutions into products and relaying complex ideas.
Research acumen and adaptability
Ability to read, interpret, and prototype cutting-edge AI literature, publish findings, and rapidly iterate on experiments.
Machine Learning fundamentals
Solid grounding in ML theory (e.g., optimization, statistics, data structures) and experience with model evaluation, data manipulation, and pipeline design.
Familiarity with prompt engineering and advanced NLP concepts
Many roles highlight crafting effective prompts, aligning model outputs with user needs, and leveraging text-generation or conversational AI techniques.
o1-generated Senior AI ML Engineer
And here’s what o1 said about the Responsibilities and Skills observed in Senior AI ML Engineer roles:
Responsibilities
Design, develop, and deploy AI/ML solutions
End-to-end creation of machine learning systems, from initial prototypes to production-ready deployments.
Collaborate with cross-functional teams
Work closely with product managers, data scientists, engineers, and other stakeholders to align technical solutions with business goals.
Monitor and optimize model performance
Track key metrics, fine-tune models, and iterate to ensure continuous improvement and reliability in production.
Stay current with AI research and emerging technologies
Keep up-to-date with the latest breakthroughs in areas like LLMs, generative AI, and deep learning.
Mentor and coach team members
Provide guidance on best practices, design patterns, code quality, and career development for junior or peer engineers.
Develop scalable data/ML pipelines
Build robust infrastructure for data collection, preprocessing, model training, and deployment at scale.
Implement and maintain CI/CD and coding best practices
Ensure code quality, streamline release processes, and enforce testing discipline for AI/ML components.
Integrate and leverage LLMs/generative AI
Incorporate large language models or generative methods into products and workflows.
Prototype and experiment
Conduct R&D, proof-of-concepts, and pilot programs to explore emerging AI techniques and validate new product ideas.
Document and communicate findings
Produce clear technical documentation, share results with stakeholders, and provide actionable insights for decision-making.
Skills
Proficiency in Python
Commonly required for AI/ML development, data manipulation, and scripting.
Experience with ML/DL frameworks
Hands-on expertise in tools like PyTorch, TensorFlow, or JAX for building and training models.
Familiarity with cloud platforms
Working knowledge of AWS, GCP, or Azure for deploying and scaling AI solutions.
Expertise in LLMs/generative AI
Understanding of transformer architectures, prompt engineering, retrieval-augmented generation (RAG), and related libraries.
Strong software engineering fundamentals
Solid grasp of algorithms, data structures, design patterns, and best practices for production code.
Knowledge of MLOps and CI/CD
Experience with containerization (Docker, Kubernetes), version control (Git), and automated testing/monitoring.
Data processing and SQL
Skills in handling large datasets, working with Spark or similar frameworks, and writing performant SQL queries.
Effective communication and collaboration
Ability to translate complex technical concepts for non-technical stakeholders and work well in diverse teams.
Problem-solving and debugging
Track record of diagnosing issues in production environments and implementing reliable fixes.
Continuous learning mindset
Eagerness to stay on top of new AI research, frameworks, and technologies to innovate and improve solutions.
Job titles
What are the most common job titles for these roles?
Click to view the code for aggregating titles
def aggregate_titles_by_role_level(df: pd.DataFrame, output_dir: str='role_titles') ->None:""" Aggregate job titles for each role_title and level combination and save to JSON. Args: df: DataFrame containing 'role_title', 'level', and 'title' columns output_dir: Directory to save the JSON output """# Create output directory if it doesn't exist os.makedirs(output_dir, exist_ok=True)# Group by role_title and level grouped = df.groupby(['role_title', 'level'])for (role, level), group in grouped:# Get unique titles for this combination result = {'titles': group['title'].unique().tolist() }# Create safe filename from role and level safe_role = role.replace('/', '_').replace(' ', '_') safe_level = level.replace('/', '_').replace(' ', '_') filename =f"{safe_level}_{safe_role}.json"# Save to JSON filewithopen(os.path.join(output_dir, filename), 'w') as f: json.dump(result, f, indent=2)print(f"Saved titles for {level}{role}")
aggregate_titles_by_role_level(df)
Saved titles for Entry-level AI Research Scientist
Saved titles for Mid-Level / Level AI Research Scientist
Saved titles for Senior-Level / Level AI Research Scientist
Saved titles for Entry-level AI Solution Architect
Saved titles for Mid-Level / Level AI Solution Architect
Saved titles for Senior-Level / Level AI Solution Architect
Saved titles for Entry-level AI/ML Engineer
Saved titles for Mid-Level / Level AI/ML Engineer
Saved titles for Senior-Level / Level AI/ML Engineer
Saved titles for Entry-level MLOps / AI Infrastructure Engineer
Saved titles for Mid-Level / Level MLOps / AI Infrastructure Engineer
Saved titles for Senior-Level / Level MLOps / AI Infrastructure Engineer
I’ll use o1 again to generate a summary of the most common titles for each role/level combination. In the prompt, I will omit my own role classification so as not to bias the results. I’ll ask it to “identify the most common titles listed below (ignoring slight variations), and then for those titles identify how often they occurred.”
Here’s a sample of what it said.
For Senior AI Research Scientist roles, o1 observed the following titles to be common to this role:
Research Scientist/Researcher (14 occurrences)
Research Engineer (7 occurrences)
Machine Learning Engineer (3 occurrences)
Generative AI Engineer (2 occurrences)
Software Engineer (Generative AI) (2 occurrences)
For Senior AI ML Engineer roles, o1 observed the following titles to be common to this role:
Senior Software Engineer (6 occurrences)
Senior Machine Learning Engineer (5 occurrences)
Staff Software Engineer (5 occurrences)
Senior AI Engineer (4 occurrences)
Machine Learning Researcher (3 occurrences)
Staff Machine Learning Engineer (2 occurrences)
Principal Software Engineer (2 occurrences)
Source Code
---title: "Analysis of the GenAI/LLM job market in January 2025"date: 2025-01-24description: "A data-driven exploration of the GenAI/LLM job market for science and engineering roles in January 2025. I scrape ~1000 job postings from ai-jobs.net, perform data extraction and classification using LLMs, and then analyze the data to identify patterns and insights about the GenAI/LLM job market, including salary ranges, skill requirements, and role distribution."categories: - prompt-engineering - python - job-market - GenAI - LLMfreeze: true---# IntroductionAs someone currently working in a "Prompt Engineering" role, I've been thinking a lot about how this title basically doesn't exist outside of a handful of companies, how the title communicates a narrow range of skills and responsibilities,and how the work that I do day-to-day is much larger in scope than just writing prompts. I identify more as an AI Engineer or AI Research Scientist, and so I was interested to see what I could learn about other similar roles that work with GenAI and LLMs.So with that motivation, I set about to collect some data and look at the job market for these sorts of roles. What responsibilities and skills are being advertised most often, what kinds of titles are being used for these roles, and what kind of compensation is being offered at different levels of seniority?To accomplish this, I built a custom web scraper to collect ~1000 job postings from ai-jobs.net. My code gathers job details like the title, company, location, salary, and posting date -- from a range of U.S. and Canadian cities, covering entry-level, mid-level, and senior positions. I then use a Large Language Model (LLM) to extract each job's key responsibilities, required skills, and qualifications. Afterwards, I classify each position to see whether it involves working with Generative AI or large language models, and if so, categorize it further into four major AI roles: (1) AI Research Scientist, (2) AI/ML Engineer, (3) MLOps/AI Infrastructure Engineer, and (4) AI Solution Architect. I believe this set of roles is a good representation of different areas of focus.Finally, I integrate the various metadata and classifications into a comprehensive dataset. I observe that GenAI/LLM positions command consistently high salary ranges across the four different roles, particularly at more senior levels. Senior-level roles tended to offer median salaries in the $195K–$210K range, while mid-level roles generally clustered around $165K–$180K. Entry-level salaries showed greater variation (likely due to the small sample size) but still landed in competitive ranges of roughly $155K–$205K in many postings. These roles often share common technical demands—like proficiency with large-scale model training, distributed computing, and LLM-specific knowledge—though each role emphasizes distinct priorities (research vs. production, for example).Of course, this analysis is not without limitations. For example, I am relying on a single job board, and I scraped jobs during a limited window of time. I also have not rigorously validated the LLM classifications -- although I have implemented many prompt engineering best practices, and have used some of the more powerful LLMs (4o and o1). To some extent, the responsibilities and skills that were recovered from the original job postings classified into the four pre-defined roles do speak to the relative accuracy of the role classifications. The salary analysis also does not distinguish between base salary and total compensation, remote vs. in-person opportunities, large vs. small companies, and so on. But overall, I think this gives some sense of the job market for these roles.# Scraping job postingsI'll scrape job postings from ai-jobs.net. This is an aggregator that specializes in AI jobs of all kinds, sourcing jobs from over 60 countries, and in my experience, it does a pretty good job.The code below implements a web scraper for job postings from ai-jobs.net. It collects job postings from major cities in the US and Canada, searching across entry-level, mid-level, and senior positions. For each job, it gets the title, company, location, salary, description, and posting date. The scraper saves each job as a JSON file and keeps track of what it has already scraped to avoid duplicates. It includes error handling and logging to track any problems that occur during scraping.```{python}#| code-fold: true#| code-fold-show: false#| code-summary: "Click to view the Job Posting Scraper code"from dataclasses import dataclassfrom typing import List, Dict, Optional, Setfrom datetime import datetime, timedeltaimport loggingimport jsonimport requestsfrom bs4 import BeautifulSoupfrom pathlib import Pathimport reimport time@dataclassclass JobData:"""Structured container for job posting data""" title: str company: str location: str level: Optional[str] salary: Optional[str] url: str description: str scraped_date: str posted_date: Optional[str] raw_data: strclass ScrapeConfig:"""Configuration settings for the scraper""" EXPERIENCE_LEVELS = ['EN', 'MI', 'SE'] # Entry, Mid, Senior CITIES = {'5391959': 'San Francisco','5128581': 'New York City','6167865': 'Toronto','6173331': 'Vancouver','5809844': 'Seattle','4671654': 'Austin','4930956': 'Boston','5': 'Region' } CATEGORIES = {'1': 'Research','2': 'Engineering','18': 'GenerativeAI' } BASE_URL ="https://ai-jobs.net"class SalaryExtractor:"""Handles salary extraction logic"""@staticmethoddef extract_from_schema(schema_data: Dict) -> Optional[str]:"""Extract salary from schema.org data"""try:if schema_data and'baseSalary'in schema_data: base_salary = schema_data.get('baseSalary', {}).get('value', {})if base_salary: min_value = base_salary.get('minValue') max_value = base_salary.get('maxValue') currency = base_salary.get('currency', 'USD') # Default to USD if not specifiedif min_value and max_value:returnf"{currency}{float(min_value)/1000:.0f}K - {float(max_value)/1000:.0f}K"elif min_value:returnf"{currency}{float(min_value)/1000:.0f}K+"exceptExceptionas e: logging.debug(f"Error extracting schema salary: {e}")returnNone@staticmethoddef extract_from_text(description: str) -> Optional[str]:"""Extract salary from description text""" patterns = [r'(?:salary range.*?)(?:CAD|\$)?([\d,]+)\s*-\s*(?:CAD|\$)?([\d,]+)',r'(?:salary.*?)(?:CAD|\$)?([\d,]+)(?:\s*-\s*(?:CAD|\$)?([\d,]+))?' ]for pattern in patterns:if match := re.search(pattern, description, re.IGNORECASE):try: groups = match.groups()# Check if salary is in CAD currency ='CAD'if'CAD'in description.upper() else'USD'iflen(groups) ==2and groups[1]: # Range format min_sal =float(groups[0].replace(',', '')) max_sal =float(groups[1].replace(',', ''))returnf"{currency}{min_sal/1000:.0f}K - {max_sal/1000:.0f}K"elif groups[0]: # Single value format base_sal =float(groups[0].replace(',', ''))returnf"{currency}{base_sal/1000:.0f}K+"exceptValueError:continuereturnNone@classmethoddef extract(cls, soup: BeautifulSoup, description: str, schema_data: Dict) -> Optional[str]:"""Main salary extraction method"""# Try schema data firstif salary := cls.extract_from_schema(schema_data):return salary# Try description textif salary := cls.extract_from_text(description):return salary# Try salary badgeif salary_badge := soup.find('span', class_='badge rounded-pill text-bg-success'): salary_text = salary_badge.text.strip()if re.search(r'(USD|\$|\d)', salary_text):return salary_textreturnNoneclass JobPageParser:"""Handles parsing of individual job pages"""def__init__(self, html: str):self.soup = BeautifulSoup(html, 'html.parser')self.raw_html = htmldef parse_schema_data(self) -> Dict:"""Parse schema.org JSON-LD data"""if script :=self.soup.find('script', type='application/ld+json'):try: cleaned_script = re.sub(r'[\x00-\x1F\x7F-\x9F]', '', script.string)return json.loads(cleaned_script)exceptExceptionas e: logging.warning(f"Could not parse JSON-LD data: {e}")return {}def parse_job_data(self, url: str) -> JobData:"""Extract all job data from the page""" schema_data =self.parse_schema_data() scraped_date = datetime.now().isoformat()# Get company name from schema or fallback to page element company = schema_data.get('hiringOrganization', {}).get('name')ifnot company:if company_elem :=self.soup.find('a', class_='company-name'): company = company_elem.text.strip()# Get job level from meta description level =Noneif meta_desc :=self.soup.find('meta', {'name': 'description'}):if level_match := re.search(r'a ([\w-]+level)', meta_desc['content']): level = level_match.group(1).replace('-', '-level / ').title()return JobData( title=self.soup.find('h1', class_='display-5').text.strip(), company=company, location=self.soup.find('h3', class_='lead').text.strip(), level=level, salary=SalaryExtractor.extract(self.soup,self.soup.find('div', class_='job-description-text').text.strip(), schema_data ), url=url, description=self.soup.find('div', class_='job-description-text').text.strip(), scraped_date=scraped_date, posted_date=self.calculate_posted_date(scraped_date), raw_data=self.raw_html )def calculate_posted_date(self, scraped_date_str: str) -> Optional[str]:"""Calculate posting date from 'posted X time ago' text"""if match := re.search(r'Posted (\d+) (hours?|days?|weeks?|months?) ago', self.raw_html): number =int(match.group(1)) unit = match.group(2) scraped_date = datetime.fromisoformat(scraped_date_str) delta = {'hour': timedelta(hours=number),'day': timedelta(days=number),'week': timedelta(weeks=number),'month': timedelta(days=number *30) # Approximate }.get(unit.rstrip('s'))if delta:return (scraped_date - delta).isoformat()returnNoneclass AIJobScraper:"""Main scraper orchestration class"""def__init__(self, output_dir: str='json_data'):self.output_dir = Path(output_dir)self.output_dir.mkdir(exist_ok=True)self.session = requests.Session()self.config = ScrapeConfig()self.existing_jobs =self._load_existing_jobs()def _load_existing_jobs(self) -> Set[str]:"""Load set of existing job URLs""" existing_jobs =set()for f inself.output_dir.glob('*.json'):try: data = json.loads(f.read_text()) existing_jobs.add(data['url'])exceptExceptionas e: logging.warning(f"Error reading {f}: {e}")return existing_jobsdef _generate_search_urls(self) -> List[str]:"""Generate all search URL combinations""" urls = []for exp inself.config.EXPERIENCE_LEVELS:for cat inself.config.CATEGORIES:for city inself.config.CITIES: url =f"{self.config.BASE_URL}/?" url +=f"cat={cat}" url +=f"&{'reg'if city =='5'else'cit'}={city}" url +=f"&typ=1&key=&exp={exp}&sal=" urls.append(url)return urlsdef get_job_urls(self, search_url: str) -> List[str]:"""Get all job URLs from a search page"""try: response =self.session.get(search_url) response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser')if job_list := soup.find('ul', id='job-list'):return [f"{self.config.BASE_URL}{link['href']}"for link in job_list.find_all('a', href=lambda h: h and'/job/'in h) ]exceptExceptionas e: logging.error(f"Error fetching job URLs: {e}")return []def scrape_job_details(self, url: str) -> Optional[JobData]:"""Scrape details from a job page"""try: response =self.session.get(url) response.raise_for_status() parser = JobPageParser(response.text)return parser.parse_job_data(url)exceptExceptionas e: logging.error(f"Error scraping job {url}: {e}")returnNonedef save_job(self, job: JobData) ->None:"""Save job data to JSON file"""try: safe_title = re.sub(r'[^\w\s-]', '', job.title) safe_company = re.sub(r'[^\w\s-]', '', job.company or'unknown') filename =f"{safe_company}_{safe_title}_{hash(job.url)}.json"self.output_dir.joinpath(filename).write_text( json.dumps(vars(job), indent=2, ensure_ascii=False) ) logging.info(f"Saved job: {job.title} at {job.company}")exceptExceptionas e: logging.error(f"Error saving job: {e}")raisedef scrape_jobs(self, test: bool=False) ->int:"""Main scraping method""" jobs_found = jobs_skipped =0for search_url inself._generate_search_urls(): logging.debug(f"Processing search URL: {search_url}") job_urls =self.get_job_urls(search_url)if test and job_urls: job_urls = job_urls[:1]for url in job_urls:if url inself.existing_jobs: jobs_skipped +=1continueif job_data :=self.scrape_job_details(url):self.save_job(job_data)self.existing_jobs.add(url) jobs_found +=1 time.sleep(1) # Rate limiting logging.info(f"Found {jobs_found} new jobs. Skipped {jobs_skipped} existing jobs.")return {'jobs_found': jobs_found, 'jobs_skipped': jobs_skipped}def scrape_jobs(test: bool=False, verbose: bool=False) ->int:"""Convenience function to run the scraper"""ifnot verbose: logging.getLogger().setLevel(logging.WARNING)try: scraper = AIJobScraper(output_dir=Path.cwd() /'json_data')return scraper.scrape_jobs(test=test)exceptExceptionas e: logging.error(f"Scraper failed: {e}")raise``````{python}#logging.basicConfig(level=logging.INFO)#scraper = AIJobScraper()#scrape_results = scraper.scrape_jobs(test=False)#print(f"Found {scrape_results['jobs_found']} new jobs, skipped {scrape_results['jobs_skipped']} existing jobs")```# General-purpose LLM completion functionFor the classification tasks, I'll write a general-purpose LLM completion function. This function takes a prompt and a model as parameters, and returns the completion from the OpenAI API.```{python}from openai import AsyncOpenAIimport osasyncdef get_completion(prompt: str, model: str="gpt-4o-2024-08-06") ->str:"""Get a completion from the OpenAI API.""" client = AsyncOpenAI(api_key=os.getenv('OPENAI_API_KEY')) # Initialize with API key from env vars response =await client.chat.completions.create( model=model, messages=[{"role": "system", "content": prompt}], temperature=0 )return response.choices[0].message.content```# Extract Responsibilities, Skills, and QualificationsThe code below uses an LLM to extract key information from job postings that will be used to classify them into different role categories. It uses OpenAI's GPT-4o model to analyze job descriptions and extract three key components: responsibilities, skills, and qualifications. The system processes multiple jobs concurrently with rate limiting, saves the extracted data as JSON files, and includes retry logic for handling API rate limits. The code uses `asyncio` for concurrent processing and includes error handling and logging. It also checks for previously processed jobs to avoid duplicate work.Also included is code that implements data preprocessing steps. This code has functions to load JSON files into a pandas DataFrame, uses a list of keywords to filter out certain jobs that are not relevant to the analysis, and removes duplicate or highly similar job descriptions using TF-IDF vectorization and cosine similarity. The main function `process_job_listings()` combines these steps, taking a directory path, similarity threshold, and filter keywords as parameters. It returns a dictionary containing the processed DataFrame along with counts of the original and filtered entries.```{python}#| code-fold: true#| code-fold-show: false#| code-summary: "Click to view the code for LLM extraction"import osimport jsonimport loggingimport asyncioimport hashlibimport nest_asynciofrom typing import Dict, List, Optionalimport pandas as pdfrom bs4 import BeautifulSoupfrom tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_typefrom openai import AsyncOpenAI, RateLimitErrorfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import cosine_similarityimport re# Enable nested event loops for Jupyter notebooksnest_asyncio.apply()# Set up logging configurationlogging.basicConfig( level=logging.INFO,format='%(asctime)s - %(levelname)s - %(message)s')PROMPTS = {"prompt": """ You will be given a Job Description. Your task is to extract a list of Responsibilities, Skills, and Qualifications. - Responsibilities are the tasks and activities that the job requires the employee to perform. - Skills are the abilities and knowledge that the employee needs to have to perform the responsibilities. - Qualifications are the requirements that the employee needs to meet to be considered for the job. <JobDescription>{job_description} </JobDescription> Return a list of Responsibilities, Skills, and Qualifications as follows: <Responsibilities> [Bullet point list of responsibilities] </Responsibilities> <Skills> [Bullet point list of skills] </Skills> <Qualifications> [Bullet point list of qualifications] </Qualifications> """}class JobProcessor:def__init__(self, output_dir='json_extracted_data'):self.output_dir = output_dir os.makedirs(output_dir, exist_ok=True)self.semaphore = asyncio.Semaphore(3)self.processed =0self.skipped =0def _parse_llm_response(self, response: str) -> Dict[str, List[str]]:"""Parse LLM response into structured data.""" result = {"responsibilities": [],"skills": [],"qualifications": [] } sections = {"responsibilities": r'<Responsibilities>\n(.*?)\n</Responsibilities>',"skills": r'<Skills>\n(.*?)\n</Skills>',"qualifications": r'<Qualifications>\n(.*?)\n</Qualifications>' }for section, pattern in sections.items():if match := re.search(pattern, response, re.DOTALL): result[section] = [ item.strip('- ') for item in match.group(1).strip().split('\n')if item.strip('- ') # Filter out empty items ]return resultasyncdef extract_job_details(self, job_id: str, job_description: str) -> Dict[str, List[str]]:"""Extract structured information from job description using LLM."""try: prompt = PROMPTS["prompt"].format(job_description=job_description) response =await get_completion(prompt) result =self._parse_llm_response(response)# Save results json_path = os.path.join(self.output_dir, f'{job_id}.json')withopen(json_path, 'w', encoding='utf-8') as f: json.dump(result, f, indent=2, ensure_ascii=False)return resultexceptExceptionas e: logging.error(f"Error extracting job details: {str(e)}")return {"responsibilities": [], "skills": [], "qualifications": []}asyncdef process_single_job(self, job_id: str, description: str) -> Dict[str, List[str]]:"""Process a single job with caching.""" json_path = os.path.join(self.output_dir, f'{job_id}.json')if os.path.exists(json_path):self.skipped +=1 logging.debug(f"Skipping job ID {job_id} - already processed")withopen(json_path, 'r', encoding='utf-8') as f:return json.load(f)asyncwithself.semaphore:self.processed +=1 logging.info(f"Processing job ID: {job_id}")returnawaitself.extract_job_details(job_id, description)asyncdef process_jobs(self, df: pd.DataFrame) -> List[Dict[str, List[str]]]:"""Process multiple jobs concurrently.""" tasks = [self.process_single_job(str(row['id']), row['description'])for _, row in df.iterrows() ] results =await asyncio.gather(*tasks)print(f"\nProcessing complete:")print(f"- New jobs processed: {self.processed}")print(f"- Skipped jobs (already processed): {self.skipped}")return resultsclass DataPreprocessor:def__init__(self, similarity_threshold=0.8):self.similarity_threshold = similarity_thresholdself.default_filter_keywords = ['Dir ', 'Director', 'Intern', 'Data Scientist', 'Data Science','Content Writer', 'Faculty', 'Product Owner', 'Manager','Analyst', 'Postdoctoral', 'Postdoc', 'Summer' ]@staticmethoddef load_json_files(directory='json_data') -> pd.DataFrame:"""Load JSON files into DataFrame with error handling.""" df = pd.DataFrame()ifnot os.path.exists(directory): logging.error(f"Directory {directory} does not exist")return dfforfilein [f for f in os.listdir(directory) if f.endswith('.json')]:try:withopen(os.path.join(directory, file), 'r') as f: data = json.load(f) df = pd.concat([df, pd.DataFrame([data])], ignore_index=True)exceptExceptionas e: logging.error(f"Error loading {file}: {e}")ifnot df.empty: df['id'] = df['url'].apply(lambda x: f"j{hashlib.md5(x.encode()).hexdigest()[:5]}" )return dfdef filter_by_title(self, df: pd.DataFrame, filter_keywords: Optional[List[str]] =None) -> pd.DataFrame:"""Filter DataFrame by job titles."""if df.empty:return df keywords = filter_keywords orself.default_filter_keywordsreturn df[~df['title'].str.contains('|'.join(keywords), case=False, na=False )]def remove_similar_descriptions(self, df: pd.DataFrame) -> pd.DataFrame:"""Remove similar job descriptions using TF-IDF and cosine similarity."""if df.empty:return df tfidf = TfidfVectorizer(stop_words='english') tfidf_matrix = tfidf.fit_transform(df['description'].fillna('')) cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix) indices_to_remove = { j for i inrange(len(df))for j inrange(i +1, len(df))if cosine_sim[i][j] >self.similarity_threshold }return df.iloc[~df.index.isin(list(indices_to_remove))]def extract_jobs_from_dataframe(df: pd.DataFrame) -> List[Dict[str, List[str]]]:"""Wrapper function to process jobs from a DataFrame.""" processor = JobProcessor() loop = asyncio.get_event_loop()if loop.is_running():return loop.create_task(processor.process_jobs(df))else:return asyncio.run(processor.process_jobs(df))def process_job_listings(directory='json_data', similarity_threshold=0.8, filter_keywords=None) -> Dict:"""Main function to process job listings.""" preprocessor = DataPreprocessor(similarity_threshold)# Load and preprocess data df = preprocessor.load_json_files(directory) initial_count =len(df)if df.empty:return {'processed_data': df,'original_count': 0,'filtered_count': 0 } df = preprocessor.filter_by_title(df, filter_keywords) df_filtered = preprocessor.remove_similar_descriptions(df)return {'processed_data': df_filtered,'original_count': initial_count,'filtered_count': len(df_filtered) }``````{python}results = process_job_listings()df = results['processed_data']print(f'Original count of listings: {results["original_count"]}')print(f'Filtered count of listings: {results["filtered_count"]}')extract_jobs_from_dataframe(df)```# Classification## Classify jobs as relevant to GenAI/LLM work or not```{python}#| code-fold: true#| code-fold-show: false#| code-summary: "Click to view the code for job classification as GenAI/LLM relevant or not"import nest_asyncioimport reimport asyncioimport loggingfrom typing import List, Dict, Setfrom bs4 import BeautifulSoupimport pandas as pdfrom tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_typefrom openai import AsyncOpenAIimport osfrom openai import RateLimitErrorimport jsonclass JobClassifierGenAI:"""Handles classification of jobs for GenAI/LLM work""" PROMPT =""" You will be given a list of Responsibilities and Skills listed for a job. Your task is to determine if the job involves working with Generative AI (GenAI) or language models (a.k.a. Large Language Models (LLMs)). <Job> <Responsibilities>{responsibilities} </Responsibilities> <Skills>{skills} </Skills> </Job> Start by thinking step-by-step about the Job and its Responsibilities and Skills, and whether it involves working with Generative AI (GenAI) or language models (a.k.a. Large Language Models (LLMs)). Return your response in the following format: <Analysis> [Your analysis of the job and its Responsibilities and Skills] </Analysis> <FinalAnswer> true|false </FinalAnswer> """def__init__(self, output_dir='role_genai_classifications', batch_size=3):self.output_dir = output_dirself.batch_size = batch_sizeself.semaphore = asyncio.Semaphore(batch_size) os.makedirs(output_dir, exist_ok=True)asyncdef classify_job(self, job: Dict) -> Dict:"""Classify a single job listing""" logging.info(f"Classifying job '{job['filename']}'") prompt =self.PROMPT.format( responsibilities=job['responsibilities'], skills=job.get('skills', '') )try: response =await get_completion(prompt)await asyncio.sleep(2) # Rate limitingexceptExceptionas e: logging.error(f"Failed to classify job '{job['filename']}': {str(e)}")return {**job,'analysis': f'Failed to process: {str(e)}','is_genai_role': None }returnself._parse_response(job, response)def _parse_response(self, job: Dict, response: str) -> Dict:"""Parse LLM response into structured format""" soup = BeautifulSoup(f"<root>{response}</root>", 'lxml-xml') analysis = soup.find('Analysis') final_answer = soup.find('FinalAnswer') is_genai_role =Noneif final_answer: answer_text = final_answer.text.strip().lower() is_genai_role =Trueif answer_text =='true'elseFalseif answer_text =='false'elseNonereturn {**job,'analysis': analysis.text.strip() if analysis else'','is_genai_role': is_genai_role }def save_classification(self, job_id: str, result: Dict) ->None:"""Save classification results to file""" filename = os.path.join(self.output_dir, f"{job_id}.json")withopen(filename, 'w') as f: json.dump(result, f, indent=2) logging.info(f"Saved classification for job {job_id}")def get_classified_jobs(self) -> Set[str]:"""Get set of already classified job IDs"""ifnot os.path.exists(self.output_dir):returnset()return {f[:-5] for f in os.listdir(self.output_dir) if f.endswith('.json')}asyncdef process_jobs_batch(self, jobs: List[Dict]) ->None:"""Process a batch of jobs concurrently"""asyncdef process_with_semaphore(job: Dict) ->None:asyncwithself.semaphore: result =awaitself.classify_job(job) job_id =str(result['filename'])self.save_classification(job_id, result)await asyncio.gather(*[process_with_semaphore(job) for job in jobs])asyncdef classify_jobs_async(self, df: pd.DataFrame) ->None:"""Process all unclassified jobs in the DataFrame""" total_jobs =len(df) logging.info(f"Starting classification of {total_jobs} jobs") classified_jobs =self.get_classified_jobs() jobs_to_process = [ job.to_dict() for idx, job in df.iterrows() ifstr(job['filename']) notin classified_jobs ]awaitself.process_jobs_batch(jobs_to_process) logging.info(f"Completed classification of all {total_jobs} jobs")def classify_jobs(self, df: pd.DataFrame) ->None:"""Main entry point for job classification"""if df.empty: logging.warning("Empty DataFrame provided")returnif'filename'notin df.columns: logging.error("DataFrame missing required 'filename' column")return classified_jobs =self.get_classified_jobs() logging.info(f"Found {len(classified_jobs)} previously classified jobs") new_jobs = df[~df['filename'].isin(classified_jobs)]if new_jobs.empty: logging.info("No new jobs to classify")return logging.info(f"Processing {len(new_jobs)} new jobs") logging.info(f"Skipping {len(df) -len(new_jobs)} existing jobs") loop = asyncio.get_event_loop() loop.run_until_complete(self.classify_jobs_async(new_jobs))class JobDataLoader:"""Handles loading and preprocessing of job data"""@staticmethoddef read_json_files(json_dir='json_extracted_data') -> List[Dict]:"""Read job data from JSON files""" result = []for filename in os.listdir(json_dir):if filename.endswith('.json'): file_path = os.path.join(json_dir, filename)try:withopen(file_path, 'r') as f: data = json.load(f) name = filename[:-5]if'responsibilities'in data and'skills'in data: result.append({'filename': name,'responsibilities': data['responsibilities'],'skills': data['skills'] })except json.JSONDecodeError: logging.error(f"Invalid JSON in {filename}")exceptExceptionas e: logging.error(f"Error processing {filename}: {str(e)}")return result@staticmethoddef load_classifications(input_dir='role_genai_classifications') -> pd.DataFrame:"""Load classification results into DataFrame"""ifnot os.path.exists(input_dir):return pd.DataFrame() all_results = []for filename in os.listdir(input_dir):if filename.endswith('.json'):withopen(os.path.join(input_dir, filename), 'r') as f: classification = json.load(f) all_results.append(classification)return pd.DataFrame(all_results)``````{python}loader = JobDataLoader()jobs = loader.read_json_files()df = pd.DataFrame(jobs)#classifier = JobClassifierGenAI()#classifier.classify_jobs(df)# Load results#results_df = loader.load_classifications()```## Classify GenAI/LLM jobs into pre-defined categoriesFor this next classification task, I'll make the assumption that there are four types of AI engineering and science roles that are relevant to work with GenAI systems. These are:1. AI Research Scientist2. AI/ML Engineer3. MLOps / AI Infrastructure Engineer4. AI Solution ArchitectI'll also include other categories that are not of interest, but may improve classification accuracy. These are:5. Data Scientist6. Data Engineer7. Product Manager8. Software EngineerThe code below implements an automated job classification system that uses an LLM to categorize job postings into the eight predefined roles listed above (four GenAI-focused and four related roles). It consists of several classes that work together: JobClassifier handles the core classification logic by comparing job descriptions against detailed role templates, JobData and ClassificationResult provide structured data containers, and JobProcessor manages the overall pipeline from loading jobs to saving results. The system processes jobs concurrently using asyncio, includes error handling and rate limiting, and outputs both an analysis explaining the classification and a final numerical category (0-8) for each job, with all results saved as JSON files for further analysis.Definitions of the roles can be found in the `JOB_DESCRIPTIONS` variable.```{python}#| code-fold: true#| code-fold-show: false#| code-summary: "Click to view the code for job classification into pre-defined roles"from dataclasses import dataclassfrom typing import List, Dict, Set, Optionalimport loggingimport jsonimport asyncioimport osfrom bs4 import BeautifulSoupimport pandas as pdimport nest_asyncioJOB_DESCRIPTIONS ="""<Option title="AI Research Scientist" number="1"> <PrimaryFocus> Investigate and adapt cutting-edge AI methodologies (e.g., generative models, advanced prompt engineering) for applications. </PrimaryFocus> <KeyResponsibilities> Conduct experiments to evaluate the performance (e.g., quality, accuracy) of new AI approaches and refine existing models. Collaborate with AI/ML Engineers to transition successful prototypes into production. Stay current with the latest AI research and emerging trends in generative AI. Develop human-annotated datasets for training and evaluation of AI models. </KeyResponsibilities> <SkillsAndTools> Deep understanding of LLMs and prompt engineering. Strong background in statistics, optimization, or related fields. Knowledge of experimental methods (e.g., A/B testing) and hypothesis testing. Knowledge of LLM evaluation methods, including algorithmic evals, human evals, or LLM-as-a-judge evals. </SkillsAndTools></Option><Option title="AI/ML Engineer" number="2"> <PrimaryFocus> Transform research output into robust, scalable AI solutions for the product or internal use. </PrimaryFocus> <KeyResponsibilities> Productionize AI models, ensuring they meet performance and reliability requirements. Develop and maintain data pipelines for model training, inference, and monitoring. Collaborate closely with Research Scientists to optimize and refine model implementations. </KeyResponsibilities> <SkillsAndTools> Proficiency in Python, Go, or similar languages. Experience with API development and integration (REST, GraphQL). Working knowledge of software engineering best practices (version control, testing, CI/CD). </SkillsAndTools></Option><Option title="MLOps / AI Infrastructure Engineer" number="3"> <PrimaryFocus> Ensure reliable deployment, scaling, and monitoring of AI systems in production. </PrimaryFocus> <KeyResponsibilities> Set up CI/CD pipelines tailored for AI workflows, including model versioning and data governance. Monitor production models for performance, latency, and data drift, implementing necessary updates. Manage infrastructure for scalable AI deployments (Docker, Kubernetes, cloud services). </KeyResponsibilities> <SkillsAndTools> Strong DevOps background, with tools like Docker, Kubernetes, and Terraform. Familiarity with ML orchestration/monitoring tools (MLflow, Airflow, Prometheus). Experience optimizing compute usage (GPU/TPU) for cost-effective scaling. </SkillsAndTools></Option><Option title="AI Solution Architect" number="4"> <PrimaryFocus> Design and orchestrate AI solutions leveraging generative models and LLM technologies to create impactful experiences and solutions that align with business objectives. </PrimaryFocus> <KeyResponsibilities> Collaborate with subject matter experts (SMEs) to identify and refine opportunities for generative AI/LLM-based use cases. Assess feasibility and define high-level solution architectures, ensuring they address core business and user requirements. Develop technical proposals and roadmaps, translating complex requirements into actionable plans. Provide thought leadership on conversational design, user experience flow, and model interaction strategies. Ensure solutions comply with relevant data governance, privacy, and security considerations. Facilitate cross-functional collaboration, guiding teams through solution conceptualization and implementation phases. </KeyResponsibilities> <SkillsAndTools> Strong understanding of LLM capabilities and prompt engineering principles. Experience with conversational experience design (e.g., chatbots, voice interfaces) and user journey mapping. Ability to analyze business needs and translate them into feasible AI solution proposals. Familiarity with data privacy and security best practices, especially as they pertain to AI solutions. Excellent communication and stakeholder management skills to align technical and non-technical teams. </SkillsAndTools></Option><Option title="Data Scientist" number="5"> <PrimaryFocus> Leverage statistical analysis, machine learning, and data visualization to derive actionable insights and guide data-informed decisions. </PrimaryFocus> <KeyResponsibilities> Perform exploratory data analysis (EDA) to identify trends and patterns in large, complex datasets. Develop and validate predictive and prescriptive models, collaborating with cross-functional teams to implement these solutions. Design and execute experiments to test hypotheses, measure impact, and inform business strategies. Present findings and recommendations to stakeholders in a clear, concise manner using visualizations and dashboards. Work with data engineers to ensure data quality, governance, and availability. </KeyResponsibilities> <SkillsAndTools> Proficiency in Python, R, or SQL for data manipulation and analysis. Experience with common ML libraries (e.g., scikit-learn, XGBoost) and deep learning frameworks (e.g., PyTorch, TensorFlow). Solid grounding in statistics, probability, and experimental design. Familiarity with data visualization tools (e.g., Tableau, Power BI) for communicating insights. Strong analytical thinking and ability to translate complex data problems into business solutions. </SkillsAndTools> </Option><Option title="Data Engineer" number="6"> <PrimaryFocus> Design, build, and maintain scalable data pipelines and architectures that enable efficient data collection, storage, and analysis. </PrimaryFocus> <KeyResponsibilities> Develop and optimize data ingestion and transformation processes (ETL/ELT), ensuring high performance and reliability. Implement and manage data workflows, integrating internal and external data sources. Collaborate with Data Scientists, AI/ML Engineers, and other stakeholders to ensure data readiness for analytics and model training. Monitor data pipelines for performance, reliability, and cost-effectiveness, taking corrective actions when needed. Maintain data quality and governance standards, including metadata management and data cataloging. </KeyResponsibilities> <SkillsAndTools> Proficiency in Python, SQL, and distributed data processing frameworks (e.g., Spark, Kafka). Experience with cloud-based data ecosystems (AWS, GCP, or Azure), and related storage/processing services (e.g., S3, BigQuery, Dataflow). Familiarity with infrastructure-as-code and DevOps tools (Terraform, Docker, Kubernetes) for automating data platform deployments. Strong understanding of database systems (relational, NoSQL) and data modeling principles. Knowledge of data orchestration and workflow management tools (Airflow, Luigi, Dagster). </SkillsAndTools></Option><Option title="Product Manager" number="7"> <PrimaryFocus> Drive the product vision and strategy, ensuring alignment with business goals and user needs while delivering impactful AI-driven solutions. </PrimaryFocus> <KeyResponsibilities> Conduct user and market research to identify opportunities, define product requirements, and set success metrics. Collaborate with cross-functional teams (Engineering, Data Science, Design) to prioritize features and plan releases. Develop and communicate product roadmaps, ensuring stakeholders are aligned on goals and timelines. Monitor product performance through data analysis and user feedback, iterating on improvements and new feature ideas. Facilitate agile development practices, writing clear user stories and acceptance criteria. </KeyResponsibilities> <SkillsAndTools> Strong understanding of product lifecycle management and agile methodologies (Scrum/Kanban). Excellent communication, negotiation, and stakeholder management skills. Experience with product management and collaboration tools (e.g., Jira, Confluence, Trello). Analytical mindset for leveraging metrics, A/B testing, and user feedback in decision-making. Familiarity with AI/ML concepts and the ability to translate technical possibilities into viable product features. </SkillsAndTools></Option><Option title="Software Engineer" number="8"> <PrimaryFocus> Design, develop, and maintain high-quality software applications and services that address user needs and align with overall business objectives. </PrimaryFocus> <KeyResponsibilities> Collaborate with cross-functional teams (Product, Design, QA) to interpret requirements and deliver robust solutions. Write clean, efficient, and testable code following best practices and coding standards. Participate in system architecture and design discussions, contributing to the evolution of technical roadmaps. Perform code reviews and provide constructive feedback to peers, maintaining a high bar for code quality. Implement and maintain CI/CD pipelines to streamline deployment and reduce manual interventions. Continuously improve system performance and scalability through profiling and optimization. </KeyResponsibilities> <SkillsAndTools> Proficiency in one or more programming languages (e.g., Java, Python, JavaScript, C++). Experience with modern frameworks/libraries (e.g., Spring Boot, Node.js, React, Django). Solid understanding of software design principles (e.g., SOLID, DRY) and architectural patterns (e.g., microservices). Familiarity with version control systems (Git), testing frameworks, and agile methodologies. Working knowledge of containerization (Docker), orchestration (Kubernetes), and cloud platforms (AWS, Azure, GCP). </SkillsAndTools></Option>"""PROMPTS = {"prompt": """You will be given a list of Responsibilities andSkills listed for a job. Your task is to determineif the job is a good fit with any of the Options,and if so, which one.<Job><Responsibilities>{responsibilities}</Responsibilities><Skills>{skills}</Skills></Job><Options>{Options}</Options>Start by thinking step-by-step about the Job and its Responsibilities and Skills, in relation to each of the Options.Decide if the Job is a good fit with ANY of the Options.If NONE of the Options are relevant to the Job, say so and return a 0 as your FinalAnswer.Otherwise, decide which of the Options is the most similarto the Job and return its number as your FinalAnswer.Return your response in the following format:<Analysis>[Your analysis of the job and its Responsibilities and Skills, in relation each of the Options]</Analysis><FinalAnswer>0|1|2|3|4|5|6|7|8</FinalAnswer>"""}# Enable nested event loopsnest_asyncio.apply()@dataclassclass JobData:"""Represents a job posting with extracted information.""" filename: str responsibilities: List[str] skills: List[str]@dataclassclass ClassificationResult:"""Represents the result of a job classification.""" filename: str responsibilities: List[str] skills: List[str] analysis: str role_classification: Optional[int] role_title: Optional[str]class JobClassifier:"""Handles classification of jobs into predefined roles."""def__init__(self, output_dir: str='role_classifications', batch_size: int=3):self.output_dir = output_dirself.batch_size = batch_sizeself.semaphore = asyncio.Semaphore(batch_size) os.makedirs(output_dir, exist_ok=True)asyncdef classify_job(self, job: JobData) -> ClassificationResult:"""Classify a single job listing.""" logging.info(f"Classifying job '{job.filename}'") prompt = PROMPTS["prompt"].format( responsibilities=job.responsibilities, # Access as attribute instead of dict skills=job.skills, # Access as attribute instead of dict Options=JOB_DESCRIPTIONS )try: response =await get_completion(prompt)await asyncio.sleep(5) # Rate limitingreturnself._parse_response(job, response)exceptExceptionas e: logging.error(f"Failed to classify job '{job.filename}': {str(e)}")return ClassificationResult( filename=job.filename, responsibilities=job.responsibilities, skills=job.skills, analysis=f'Failed to process: {str(e)}', role_classification=None, role_title=None )def _parse_response(self, job: JobData, response: str) -> ClassificationResult:"""Parse LLM response into structured format.""" soup = BeautifulSoup(f"<root>{response}</root>", 'lxml-xml') analysis = soup.find('Analysis') role_choice = soup.find('FinalAnswer') role_number =int(role_choice.text.strip()) if role_choice elseNone role_title =self._get_role_title(role_number)return ClassificationResult( filename=job.filename, responsibilities=job.responsibilities, skills=job.skills, analysis=analysis.text.strip() if analysis else'', role_classification=role_number, role_title=role_title )def _get_role_title(self, role_number: Optional[int]) -> Optional[str]:"""Get the title for a role number."""if role_number isNone:returnNoneif role_number ==0:return"Other" wrapped_xml =f"<root>{JOB_DESCRIPTIONS}</root>" job_descriptions_soup = BeautifulSoup(wrapped_xml, 'lxml-xml') matching_job = job_descriptions_soup.find('Option', {'number': str(role_number)})if matching_job:return matching_job['title'] logging.error(f"No matching job found for role number {role_number}")returnNoneclass JobProcessor:"""Handles the processing of job data files."""def__init__(self, input_dir: str='json_extracted_data'):self.input_dir = input_dirself.classifier = JobClassifier()def load_jobs(self) -> List[JobData]:"""Load jobs from JSON files.""" jobs = []# Get list of all JSON files total_files =len([f for f in os.listdir(self.input_dir) if f.endswith('.json')])# Get files that were classified as GenAI-relevant classified_files =self._get_classified_files()# Get files that have already been processed processed_files = { f[:-5] for f in os.listdir(self.classifier.output_dir) if f.endswith('.json') }# Get relevant files that haven't been processed yet files_to_process = classified_files - processed_files logging.info(f"Found {total_files} total files") logging.info(f"Found {len(classified_files)} relevant GenAI files") logging.info(f"Already processed: {len(processed_files)} files") logging.info(f"Remaining to process: {len(files_to_process)} files")# Only process files that are both relevant and unprocessedfor filename in os.listdir(self.input_dir):ifnot filename.endswith('.json'):continue name = filename[:-5]if name notin files_to_process:continuetry: job =self._load_job_file(filename)if job: jobs.append(job)exceptExceptionas e: logging.error(f"Error processing {filename}: {str(e)}") logging.info(f"Processing {len(jobs)} remaining jobs")return jobsdef _load_job_file(self, filename: str) -> Optional[JobData]:"""Load and parse a single job file.""" file_path = os.path.join(self.input_dir, filename)try:withopen(file_path, 'r') as f: data = json.load(f)if'responsibilities'in data and'skills'in data:return JobData( filename=filename[:-5], responsibilities=data['responsibilities'], skills=data['skills'] )except json.JSONDecodeError: logging.error(f"Invalid JSON in {filename}")returnNonedef _get_classified_files(self) -> Set[str]:"""Get set of files that have been previously classified as GenAI-related.""" genai_dir ='role_genai_classifications'ifnot os.path.exists(genai_dir):returnset() genai_files =set()for f in os.listdir(genai_dir):if f.endswith('.json'):try:withopen(os.path.join(genai_dir, f), 'r') asfile: data = json.load(file)if data.get('is_genai_role') isTrue: # Explicitly check for True genai_files.add(f[:-5])exceptExceptionas e: logging.error(f"Error reading {f}: {e}")return genai_filesasyncdef process_jobs(self) ->None:"""Process all jobs.""" jobs =self.load_jobs()ifnot jobs:return logging.info(f"Processing {len(jobs)} relevant jobs")awaitself._process_jobs_batch(jobs) logging.info("Job classification complete")asyncdef _process_jobs_batch(self, jobs: List[JobData]) ->None:"""Process a batch of jobs concurrently."""asyncdef process_with_semaphore(job: JobData) ->None:asyncwithself.classifier.semaphore: result =awaitself.classifier.classify_job(job)self._save_result(result)await asyncio.gather(*[process_with_semaphore(job) for job in jobs])def _save_result(self, result: ClassificationResult) ->None:"""Save classification result to file.""" filename = os.path.join(self.classifier.output_dir, f"{result.filename}.json")withopen(filename, 'w') as f: json.dump(vars(result), f, indent=2) logging.info(f"Saved classification for job {result.filename}")def classify_job_roles(df: pd.DataFrame) ->None:"""Main entry point for job classification."""if df.empty: logging.warning("Empty DataFrame provided")return processor = JobProcessor() loop = asyncio.get_event_loop() loop.run_until_complete(processor.process_jobs())``````{python}import logginglogging.basicConfig(level=logging.INFO)loader = JobDataLoader()jobs = loader.read_json_files()df = pd.DataFrame(jobs)classify_job_roles(df)```# Load final dataNow I'll load and combine all of the different data outputs into a single dataframe.The code block below uses a DataLoader class with helper classes DataPaths and SalaryProcessor to handle the data processing pipeline. The code merges job posting data with role classifications, processes salary information by converting CAD to USD and extracting salary ranges, and filters the results to only include the four specific AI-related roles of interest: AI Research Scientist, AI Solution Architect, AI/ML Engineer, and MLOps/AI Infrastructure Engineer. The code includes error handling for JSON file loading and salary processing. The final output is a pandas DataFrame containing the merged and processed data, which includes job details, role classifications, and standardized salary information.```{python}#| code-fold: true#| code-fold-show: false#| code-summary: "Click to view the code for loading and merging all data sources"from typing import List, Tuple, Optional, Dictimport pandas as pdimport jsonimport hashlibfrom dataclasses import dataclassfrom pathlib import Path@dataclassclass DataPaths:"""Configuration for data file paths""" ROLE_CLASSIFICATIONS_DIR: Path = Path('role_classifications') JSON_DATA_DIR: Path = Path('json_data')class DataLoader:def__init__(self, paths: DataPaths = DataPaths()):self.paths = pathsself.salary_processor = SalaryProcessor()def load_and_merge_data(self) -> pd.DataFrame:"""Load and merge all data sources"""# Load and merge base data merged_df =self._load_base_data()# Process salaries salary_data =self.salary_processor.extract_salary_ranges(merged_df['salary']) merged_df = pd.concat([merged_df, salary_data], axis=1)# Filter out any jobs not in this list merged_df = merged_df[merged_df['role_title'].isin( ['AI Research Scientist','AI Solution Architect','AI/ML Engineer','MLOps / AI Infrastructure Engineer'] )]print(f"Shape of merged DataFrame: {merged_df.shape}")return merged_dfdef _load_base_data(self) -> pd.DataFrame:"""Load and merge base data sources"""# Load role classifications role_data = pd.DataFrame(self._load_json_files(self.paths.ROLE_CLASSIFICATIONS_DIR))# Load and process jobs data jobs_data = pd.DataFrame(self._load_json_files(self.paths.JSON_DATA_DIR)) jobs_data['filename'] = jobs_data['url'].apply(lambda x: f"j{hashlib.md5(x.encode()).hexdigest()[:5]}" )# Merge dataframes merged_df = pd.merge(jobs_data, role_data, on='filename', how='left')return merged_dfdef _load_json_files(self, directory: Path) -> List[Dict]:"""Load JSON files from directory""" json_files = []for file_path in directory.glob('*.json'):try:withopen(file_path, 'r') as f: json_files.append(json.load(f))exceptExceptionas e:print(f"Error reading {file_path}: {e}")return json_filesclass SalaryProcessor:"""Handles salary-related processing""" CAD_TO_USD_RATE =1.44# 1 USD = 1.44 CADdef extract_salary_ranges(self, salary_series: pd.Series) -> pd.DataFrame:"""Extract salary ranges from a series of salary strings""" salary_data = salary_series.apply(self._extract_salary_range)return pd.DataFrame({'min_salary': salary_data.apply(lambda x: x[0]),'max_salary': salary_data.apply(lambda x: x[1]),'mid_salary': salary_data.apply(lambda x: (x[0] + x[1])/2if x[0] and x[1] elseNone) })def _extract_salary_range(self, salary_str: str) -> Tuple[Optional[float], Optional[float]]:"""Extract minimum and maximum salary from salary string"""try:ifnotisinstance(salary_str, str) or'0+'in salary_str:returnNone, None# Determine currency and convert if needed is_cad ='CAD'in salary_str nums = (salary_str.replace('CAD ', '') .replace('USD ', '') .replace('K', '') .split(' - '))iflen(nums) !=2:returnNone, None# Convert to float and apply CAD conversion if needed min_salary =float(nums[0]) max_salary =float(nums[1])if is_cad: min_salary *=self.CAD_TO_USD_RATE max_salary *=self.CAD_TO_USD_RATEreturn min_salary, max_salaryexceptException:returnNone, None``````{python}# Initialize the loader and load the data# Check if the csv file existsifnot os.path.exists('genai_job_data.csv'): loader = DataLoader() df = loader.load_and_merge_data() df.to_csv('genai_job_data.csv', index=False)else: df = pd.read_csv('genai_job_data.csv')# Now df contains all the merged data with processed salariesprint(f"Total jobs loaded: {len(df)}")# Example operations you can do with the DataFrame:# View available columnsprint("\nAvailable columns:")print(df.columns.tolist())# View distribution of rolesprint("\nRole distribution:")print(df['role_title'].value_counts())```# Salary analysisFirst I'll analyze the salary data for each role and experience level. I'll use a RandomForest model to adjust for location differences, and convert any CAD denominated salaries to USD. I'm interested to see what compensation is like for each role, and if there are any significant differences in compensation between the roles.The code below implements a salary analysis system for the job posting data. It processes salary information across different roles and experience levels, handling currency conversions (CAD to USD) and adjusting for location differences using a RandomForest model. For each role and level combination, it calculates salary ranges, removes statistical outliers, and provides sample sizes. The system includes configuration settings for analysis parameters and comprehensive error handling. The code uses pandas for data manipulation and scikit-learn for the location adjustment model, with results formatted as salary ranges in USD.```{python}#| code-fold: true#| code-fold-show: false#| code-summary: "Click to view the code for salary analysis"from dataclasses import dataclassfrom typing import Dict, List, Optional, TypedDict, Setimport loggingimport pandas as pdfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.model_selection import cross_val_score@dataclassclass SalaryConfig:"""Configuration settings for salary analysis""" outlier_threshold: float=1.5 n_estimators: int=100 min_group_size: int=3 currency_symbol: str='$' cv_folds: int=5class SalaryColumns(TypedDict):"""Type definitions for salary DataFrame columns""" min_salary_usd_k: float max_salary_usd_k: float median_salary_usd_k: float sample_size: int location_adjusted_zscore: floatclass SalaryAnalyzer:"""Analyzes salary distributions across role titles. Column Naming Conventions: - *_usd_k: Values in thousands of USD - *_zscore: Standardized scores (mean=0, std=1) - sample_size: Number of data points in group - *_median: Median values - location_adjusted_*: Values adjusted for location differences Attributes: config: Configuration settings for analysis logger: Logger instance for the class """# Input column names COL_MIN_SALARY ='min_salary' COL_MAX_SALARY ='max_salary' COL_ROLE_TITLE ='role_title' COL_LEVEL ='level' COL_LOCATION ='location'# Processed column names COL_MEDIAN_SALARY ='median_salary_usd_k' COL_MIN_SALARY_MEDIAN ='min_salary_median_usd_k' COL_MAX_SALARY_MEDIAN ='max_salary_median_usd_k' COL_SAMPLE_SIZE ='sample_size'# Z-score columns COL_MIN_ZSCORE ='min_salary_location_adjusted_zscore' COL_MID_ZSCORE ='median_salary_location_adjusted_zscore' COL_MAX_ZSCORE ='max_salary_location_adjusted_zscore'def__init__(self, config: Optional[SalaryConfig] =None):"""Initialize SalaryAnalyzer with optional configuration."""self.config = config or SalaryConfig()self.logger = logging.getLogger(__name__)self._model =None# Cache for trained modeldef analyze_role_salaries(self, df: pd.DataFrame) -> pd.DataFrame:""" Analyze salary distributions across role titles with location adjustment. Args: df: DataFrame containing columns: - min_salary: Minimum salary (float) - max_salary: Maximum salary (float) - role_title: Job role (str) - level: Experience level (str) - location: Job location (str) Returns: DataFrame with aggregated salary statistics by role and level Raises: ValueError: If required columns are missing or DataFrame is empty """self._validate_input(df)self.logger.info(f"Starting salary analysis with {len(df)} records")# Create working copy of DataFrame analysis_df =self._prepare_data(df)# Process and clean data analysis_df =self._process_data(analysis_df)# Generate final results results =self._aggregate_results(analysis_df)self.logger.info("Salary analysis completed successfully")return resultsdef _validate_input(self, df: pd.DataFrame) ->None:"""Validate input DataFrame."""if df isNoneor df.empty:raiseValueError("Input DataFrame cannot be None or empty") required_cols = {self.COL_MIN_SALARY, self.COL_MAX_SALARY, self.COL_ROLE_TITLE,self.COL_LEVEL,self.COL_LOCATION } missing_cols = required_cols -set(df.columns)if missing_cols:raiseValueError(f"DataFrame missing required columns: {missing_cols}")def _prepare_data(self, df: pd.DataFrame) -> pd.DataFrame:"""Prepare data for analysis by filtering valid salaries.""" valid_salary_mask = ( df[self.COL_MIN_SALARY].notna() & df[self.COL_MAX_SALARY].notna() )ifnot valid_salary_mask.any():self.logger.warning("No valid salary data found")return pd.DataFrame()# Create working copy and calculate median salary analysis_df = df[valid_salary_mask].copy() analysis_df[self.COL_MEDIAN_SALARY] = ( analysis_df[[self.COL_MIN_SALARY, self.COL_MAX_SALARY]].mean(axis=1) )return analysis_dfdef _remove_outliers(self, group: pd.DataFrame) -> pd.DataFrame:"""Remove statistical outliers from salary data."""iflen(group) <=self.config.min_group_size:return group# Only check min and max salary for outliers salary_cols = [self.COL_MIN_SALARY, self.COL_MAX_SALARY]for col in salary_cols: Q1 = group[col].quantile(0.25) Q3 = group[col].quantile(0.75) IQR = Q3 - Q1 outlier_mask =~( (group[col] < (Q1 -self.config.outlier_threshold * IQR)) | (group[col] > (Q3 +self.config.outlier_threshold * IQR)) ) group = group[outlier_mask]# Recalculate median after removing outliers group[self.COL_MEDIAN_SALARY] = ( group[self.COL_MIN_SALARY] + group[self.COL_MAX_SALARY] ) /2return groupdef _aggregate_results(self, df: pd.DataFrame) -> pd.DataFrame:"""Aggregate and format the final results."""# First calculate group statistics grouped = df.groupby([self.COL_ROLE_TITLE, self.COL_LEVEL]).agg({self.COL_MIN_SALARY: ['count', 'median'],self.COL_MAX_SALARY: 'median', }).round(0)# Calculate median salary from min and max medians grouped['median_salary'] = ( grouped[(self.COL_MIN_SALARY, 'median')] + grouped[(self.COL_MAX_SALARY, 'median')] ) /2# Flatten and rename columns grouped.columns = ['sample_size'if col == (self.COL_MIN_SALARY, 'count')else'min_salary_usd_k'if col == (self.COL_MIN_SALARY, 'median')else'max_salary_usd_k'if col == (self.COL_MAX_SALARY, 'median')else'median_salary_usd_k'if col[0] =='median_salary'else colfor col in grouped.columns ]# Order columns ordered_cols = ['sample_size','min_salary_usd_k','max_salary_usd_k','median_salary_usd_k' ] result = grouped.reindex(columns=ordered_cols)# Format salary values salary_cols = [col for col in ordered_cols if col.endswith('_usd_k')]for col in salary_cols: result[col] = result[col].apply(lambda x: f"{self.config.currency_symbol}{x:,.0f}K"if pd.notna(x) else"N/A" )return resultdef _process_data(self, df: pd.DataFrame) -> pd.DataFrame:"""Process and clean the salary data."""# Remove outliers by group df = (df.groupby([self.COL_ROLE_TITLE, self.COL_LEVEL]) .apply(self._remove_outliers) .reset_index(drop=True))# Add location-adjusted scores df =self._add_location_adjusted_scores(df)# Clean level names df[self.COL_LEVEL] = (df[self.COL_LEVEL] .str.replace('-Level / Level', '', regex=False) .str.replace('-level', '', regex=False))return dfdef _train_model(self) -> RandomForestRegressor:"""Train and cache RandomForest model."""ifself._model isNone:self._model = RandomForestRegressor( n_estimators=self.config.n_estimators, random_state=42 )returnself._modeldef _add_location_adjusted_scores(self, df: pd.DataFrame) -> pd.DataFrame:"""Add location-adjusted z-scores to the dataframe.""" location_dummies = pd.get_dummies(df[self.COL_LOCATION], prefix='loc') model =self._train_model() score_columns = {self.COL_MIN_SALARY: self.COL_MIN_ZSCORE,self.COL_MEDIAN_SALARY: self.COL_MID_ZSCORE,self.COL_MAX_SALARY: self.COL_MAX_ZSCORE }for source_col, target_col in score_columns.items(): df[target_col] =self._adjust_salaries( df[source_col], location_dummies, model )return dfdef _adjust_salaries(self, salary_series: pd.Series, X: pd.DataFrame, model: RandomForestRegressor) -> pd.Series:"""Adjust salaries using RandomForest model."""# Evaluate model performance scores = cross_val_score( model, X, salary_series, cv=self.config.cv_folds )self.logger.debug(f"Cross-validation scores: {scores.mean():.3f} ± {scores.std():.3f}")# Fit model and calculate residuals model.fit(X, salary_series) expected = model.predict(X) residuals = salary_series - expected# Return standardized residualsreturn (residuals - residuals.mean()) / residuals.std()``````{python}salary_analyzer = SalaryAnalyzer()salary_df = salary_analyzer.analyze_role_salaries(df)salary_df```Looking across all the four GenAI roles (AI Research, Solutions Architecture, ML Engineering, and MLOps), compensation bands are similar. At senior levels (where I have the largest samples), median salaries cluster tightly between ~$195K-$210K USD. Mid-level positions show medians between ~$165K-$180K USD, and entry-level positions generally start between ~$155K-$205K USD, although this entry-level data is limited by small samples.This consistency suggests these roles are valued similarly in the market, despite their different focuses.# Job responsibilitiesNext, I'm interested to see what the job responsibilities are for some of these roles, and how they differ. I'll use an LLM to identify the most common responsibilities and skills for a given role. This task will be quite complex, involving a lot of input data. So a model like `o1` will be more appropriate. I don't have access to `o1` via API, so I'll use the web interface to generate completions, and prepare XML files for the inputs to this task -- allowing me to copy-paste the XML into the chat.As we'll see, Research Scientists lean heavily into discovery, theoretical advances, publication, and cutting-edge experimentation, whereas ML Engineers center on production-grade systems, robust architecture, MLOps, and aligning with business needs. Although in many cases, the lines between these roles will be blurred.```{python}#| code-fold: true#| code-fold-show: false#| code-summary: "Click to view the code for XML generation"import loggingimport osfrom typing import Dictimport pandas as pdINSTRUCTION_PREFIX ="""Identify the most common Responsibilities and Skills listed in the jobs below, returning a bulleted list in the format# Responsibilities- [responsibility]: [responsibility description]- [responsibility]: [responsibility description]...# Skills- [skill]: [skill description]- [skill]: [skill description]..."""class XMLGenerator:"""Handles generation of XML files from job data"""def__init__(self, output_dir: str='xml_roles'):self.output_dir = output_dir os.makedirs(output_dir, exist_ok=True) logging.basicConfig( level=logging.INFO,format='%(asctime)s - %(levelname)s - %(message)s' )def compile_xml_for_role(self, level: str, role_title: str, df: pd.DataFrame) -> Dict:"""Generate XML-formatted string of job details for the given level and role title."""# Filter DataFrame for matching level and role filtered_df = df[ (df['level'] == level) & (df['role_title'] == role_title) ]iflen(filtered_df) ==0:return {'level': level,'role_title': role_title,'xml_output': '','error': f"No jobs found for {level}{role_title}" }# Build XML string xml_output =""for _, job in filtered_df.iterrows(): xml_output +="<Job>\n" xml_output +=f"<Responsibilities>{job['responsibilities']}</Responsibilities>\n" xml_output +=f"<Skills>{job['skills']}</Skills>\n" xml_output +="</Job>\n"return {'level': level,'role_title': role_title,'xml_output': INSTRUCTION_PREFIX + xml_output }def save_xml(self, level: str, role_title: str, xml_data: Dict) ->None:"""Save XML data to a file"""ifnot xml_data['xml_output']:#logging.warning(xml_data.get('error', 'Empty XML output'))return# Clean and sanitize the filename components cleaned_level = level.replace('-Level / Level', '').strip() safe_level = cleaned_level.replace('/', '_').replace(' ', '_') safe_role = role_title.replace('/', '_').replace(' ', '_') filename = os.path.join(self.output_dir, f"{safe_level}_{safe_role}.xml")withopen(filename, 'w') as f: f.write(xml_data['xml_output']) logging.debug(f"Saved XML for {cleaned_level}{role_title} to {filename}")def generate_all_xml(self, df: pd.DataFrame) ->None:"""Generate XML files for all role/level combinations"""if df.empty: logging.warning("No jobs found in DataFrame - check if data was loaded correctly")return required_cols = ['level', 'role_title', 'responsibilities', 'skills'] # Changed here tooifnotall(col in df.columns for col in required_cols): logging.error(f"DataFrame missing required columns: {required_cols}")return# Process each unique role/level combination role_combinations = df[['level', 'role_title']].drop_duplicates()for _, combo in role_combinations.iterrows(): xml_data =self.compile_xml_for_role( combo['level'], combo['role_title'], df )self.save_xml(combo['level'], combo['role_title'], xml_data)``````{python}xml_generator = XMLGenerator()xml_generator.generate_all_xml(df)```## o1-generated Senior AI Research ScientistHere's what `o1` said about the Responsibilities and Skills observed in Senior AI Research Scientist roles:### Responsibilities- **Conduct advanced AI research** Many roles require pushing the state-of-the-art in Generative AI, LLMs, and related areas (e.g., video/multimodal models, diffusion models) through novel algorithms, architectures, and experimental studies.- **Train and fine-tune large-scale models** Commonly involves working with massive datasets and distributed training setups (thousands of GPUs, HPC environments) to develop foundation models and advanced AI systems.- **Develop and implement new algorithms or architectures** Spans designing novel model architectures (e.g., diffusion, transformer-based, multimodal fusion) and creating robust data processing or simulation pipelines to support AI solutions.- **Collaborate with cross-functional teams** Emphasizes close work with engineering, product management, research, and external stakeholders to integrate AI breakthroughs into real-world applications and products.- **Evaluate and measure AI performance** Entails building rigorous evaluation frameworks, designing new metrics, and systematically analyzing model behavior to ensure quality and reliability.- **Publish and communicate research findings** Many positions highlight writing influential papers, presenting at conferences, and sharing innovative results both internally and with the broader AI community.- **Build and maintain data pipelines** Involves constructing high-quality, scalable data pipelines or tooling to support training, fine-tuning, and inference of large models.- **Ensure production-grade implementation** Requires writing clean, efficient, and maintainable code as well as optimizing models and pipelines to meet performance, reliability, and quality standards.### Skills- **Proficiency in Python and deep learning frameworks** Strong coding skills in Python and hands-on experience with libraries such as PyTorch, TensorFlow, or JAX appear in nearly every role.- **Expertise with LLMs and Generative AI** Deep understanding of transformer architectures, diffusion models, multimodal systems, prompt engineering, and other advanced AI techniques is frequently mentioned.- **Experience with large-scale/distributed training** Many roles emphasize knowledge of HPC, GPU optimization, model parallelism (e.g., FSDP, DeepSpeed, Megatron-LM), and handling massive datasets.- **Strong software engineering practices** Testing, code review, debugging, version control, and producing clean, modular research or production code are consistently important.- **Collaboration and communication skills** Clear written and verbal communication, along with cross-functional teamwork, is vital for integrating AI solutions into products and relaying complex ideas.- **Research acumen and adaptability** Ability to read, interpret, and prototype cutting-edge AI literature, publish findings, and rapidly iterate on experiments.- **Machine Learning fundamentals** Solid grounding in ML theory (e.g., optimization, statistics, data structures) and experience with model evaluation, data manipulation, and pipeline design.- **Familiarity with prompt engineering and advanced NLP concepts** Many roles highlight crafting effective prompts, aligning model outputs with user needs, and leveraging text-generation or conversational AI techniques.## o1-generated Senior AI ML EngineerAnd here's what `o1` said about the Responsibilities and Skills observed in Senior AI ML Engineer roles:### Responsibilities- **Design, develop, and deploy AI/ML solutions** End-to-end creation of machine learning systems, from initial prototypes to production-ready deployments.- **Collaborate with cross-functional teams** Work closely with product managers, data scientists, engineers, and other stakeholders to align technical solutions with business goals.- **Monitor and optimize model performance** Track key metrics, fine-tune models, and iterate to ensure continuous improvement and reliability in production.- **Stay current with AI research and emerging technologies** Keep up-to-date with the latest breakthroughs in areas like LLMs, generative AI, and deep learning.- **Mentor and coach team members** Provide guidance on best practices, design patterns, code quality, and career development for junior or peer engineers.- **Develop scalable data/ML pipelines** Build robust infrastructure for data collection, preprocessing, model training, and deployment at scale.- **Implement and maintain CI/CD and coding best practices** Ensure code quality, streamline release processes, and enforce testing discipline for AI/ML components.- **Integrate and leverage LLMs/generative AI** Incorporate large language models or generative methods into products and workflows.- **Prototype and experiment** Conduct R&D, proof-of-concepts, and pilot programs to explore emerging AI techniques and validate new product ideas.- **Document and communicate findings** Produce clear technical documentation, share results with stakeholders, and provide actionable insights for decision-making.### Skills- **Proficiency in Python** Commonly required for AI/ML development, data manipulation, and scripting.- **Experience with ML/DL frameworks** Hands-on expertise in tools like PyTorch, TensorFlow, or JAX for building and training models.- **Familiarity with cloud platforms** Working knowledge of AWS, GCP, or Azure for deploying and scaling AI solutions.- **Expertise in LLMs/generative AI** Understanding of transformer architectures, prompt engineering, retrieval-augmented generation (RAG), and related libraries.- **Strong software engineering fundamentals** Solid grasp of algorithms, data structures, design patterns, and best practices for production code.- **Knowledge of MLOps and CI/CD** Experience with containerization (Docker, Kubernetes), version control (Git), and automated testing/monitoring.- **Data processing and SQL** Skills in handling large datasets, working with Spark or similar frameworks, and writing performant SQL queries.- **Effective communication and collaboration** Ability to translate complex technical concepts for non-technical stakeholders and work well in diverse teams.- **Problem-solving and debugging** Track record of diagnosing issues in production environments and implementing reliable fixes.- **Continuous learning mindset** Eagerness to stay on top of new AI research, frameworks, and technologies to innovate and improve solutions.# Job titlesWhat are the most common job titles for these roles?```{python}#| code-fold: true#| code-fold-show: false#| code-summary: "Click to view the code for aggregating titles"def aggregate_titles_by_role_level(df: pd.DataFrame, output_dir: str='role_titles') ->None:""" Aggregate job titles for each role_title and level combination and save to JSON. Args: df: DataFrame containing 'role_title', 'level', and 'title' columns output_dir: Directory to save the JSON output """# Create output directory if it doesn't exist os.makedirs(output_dir, exist_ok=True)# Group by role_title and level grouped = df.groupby(['role_title', 'level'])for (role, level), group in grouped:# Get unique titles for this combination result = {'titles': group['title'].unique().tolist() }# Create safe filename from role and level safe_role = role.replace('/', '_').replace(' ', '_') safe_level = level.replace('/', '_').replace(' ', '_') filename =f"{safe_level}_{safe_role}.json"# Save to JSON filewithopen(os.path.join(output_dir, filename), 'w') as f: json.dump(result, f, indent=2)print(f"Saved titles for {level}{role}")``````{python}aggregate_titles_by_role_level(df)```I'll use `o1` again to generate a summary of the most common titles for each role/level combination. In the prompt, I will omit my own role classification so as not to bias the results. I'll ask it to "identify the most common titles listed below (ignoring slight variations), and then for those titles identify how often they occurred."Here's a sample of what it said.For Senior AI Research Scientist roles, `o1` observed the following titles to be common to this role:- Research Scientist/Researcher (14 occurrences)- Research Engineer (7 occurrences)- Machine Learning Engineer (3 occurrences)- Generative AI Engineer (2 occurrences)- Software Engineer (Generative AI) (2 occurrences)For Senior AI ML Engineer roles, `o1` observed the following titles to be common to this role:- Senior Software Engineer (6 occurrences)- Senior Machine Learning Engineer (5 occurrences)- Staff Software Engineer (5 occurrences)- Senior AI Engineer (4 occurrences)- Machine Learning Researcher (3 occurrences)- Staff Machine Learning Engineer (2 occurrences)- Principal Software Engineer (2 occurrences)