LLMpy: AI Utilities for working data scientists
·1365 words·7 mins
Table of Contents
LLMpy #
Beyond all the hype, large language models are a powerful tool in the modern data science toolkit.
llmpy (lumpy? yeah, let’s go with that) is a small utility package that makes using LLMs in data science work a little bit easier.
The workhorse of LLM-powered data science is the ability to take unstructured text data
and turn it into useable data, using simple labels, or fancy structured output schemas.
llmpy makes doing this in parallel, over a whole column of values, easy.
Get it from github.com/eointravers/llmpy.
Using it #
# Setup
import os
from dotenv import load_dotenv
from llmpy import OpenAIClient
from pydantic import BaseModel
load_dotenv()
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
llm_client = OpenAIClient(api_key=OPENAI_API_KEY, model="gpt-5.4-mini")
# Get some data
from datasets import load_dataset # huggingface
import itertools
import pandas as pd
def get_n_rows_from_hf(dataset: str, n_rows: int):
ds = load_dataset(dataset, streaming=True, split="train")
return pd.DataFrame(itertools.islice(ds, n_rows))
review_df = get_n_rows_from_hf("Yelp/yelp_review_full", 20)
for txt in review_df['text'].iloc[:2]:
print(txt)
dr. goldberg offers everything i look for in a general practitioner. he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first. really, what more do you need? i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.
Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff. It seems that his staff simply never answers the phone. It usually takes 2 hours of repeated calling to get an answer. Who has time for that or wants to deal with it? I have run into this problem with many other doctors and I just don't get it. You have office workers, you have patients with medical needs, why isn't anyone answering the phone? It's incomprehensible and not work the aggravation. It's with regret that I feel that I have to give Dr. Goldberg 2 stars.
Example 1: Simple text output #
system_prompt = """
Classify the following yelp review as either 'Positive', 'Negative', or 'Neutral'.
Output only the label.
""".strip()
# Classify a single value
one_result = llm_client.call(system_prompt=system_prompt, user_prompt=review_df["text"].iloc[0])
print(one_result)
Positive
# Classify everything in parallel
results = await llm_client.call_many(
system_prompt=system_prompt, user_prompt=review_df["text"], max_requests_per_minute=100
)
print(results)
['Positive', 'Negative', 'Positive', 'Negative', 'Negative', 'Positive', 'Positive', 'Negative', 'Negative', 'Neutral', 'Negative', 'Negative', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive']
Example 2: Structured outputs #
# Classification restricting output to valid options
from typing import Literal
class Sentiment(BaseModel):
value: Literal['Positive', 'Negative', 'Neutral']
structured_results = await llm_client.call_many(
system_prompt=system_prompt, user_prompt=review_df["text"],
max_requests_per_minute=100,
response_format=Sentiment
)
print(structured_results)
[Sentiment(value='Positive'), Sentiment(value='Negative'), Sentiment(value='Positive'), Sentiment(value='Positive'), Sentiment(value='Negative'), Sentiment(value='Positive'), Sentiment(value='Positive'), Sentiment(value='Negative'), Sentiment(value='Neutral'), Sentiment(value='Neutral'), Sentiment(value='Negative'), Sentiment(value='Negative'), Sentiment(value='Positive'), Sentiment(value='Negative'), Sentiment(value='Positive'), Sentiment(value='Positive'), Sentiment(value='Positive'), Sentiment(value='Positive'), Sentiment(value='Positive'), Sentiment(value='Positive')]
results = [r.value for r in structured_results]
print(results)
['Positive', 'Negative', 'Positive', 'Positive', 'Negative', 'Positive', 'Positive', 'Negative', 'Neutral', 'Neutral', 'Negative', 'Negative', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive']
# Richer structure
from pydantic import Field
class ReviewAnnotations(BaseModel):
subject_name: str = Field(description="Name of the person or place being reviewed")
subject_type: str | None = Field(description="What kind of person/place is being reviewed? Leave blank if unknown")
sentiment: Literal['Positive', 'Negative', 'Neutral']
structured_results = await llm_client.call_many(
system_prompt="Extract the information required", user_prompt=review_df["text"],
max_requests_per_minute=100,
response_format=ReviewAnnotations
)
result_df = pd.DataFrame([r.model_dump() for r in structured_results])
result_df['text'] = review_df['text']
result_df
| subject_name | subject_type | sentiment | text | |
|---|---|---|---|---|
| 0 | dr. goldberg | general practitioner | Positive | dr. goldberg offers everything i look for in a... |
| 1 | Dr. Goldberg | doctor | Negative | Unfortunately, the frustration of being Dr. Go... |
| 2 | Dr. Goldberg | doctor | Positive | Been going to Dr. Goldberg for over 10 years. ... |
| 3 | Dr. Goldberg | doctor | Positive | Got a letter in the mail last week that said D... |
| 4 | Dr. Goldberg | doctor/office | Negative | I don't know what Dr. Goldberg was like before... |
| 5 | doctor | doctor | Positive | Top notch doctor in a top notch practice. Can'... |
| 6 | Dr. Eric Goldberg | doctor | Positive | Dr. Eric Goldberg is a fantastic doctor who ha... |
| 7 | Dr. Goldberg | Doctor | Negative | I'm writing this review to give you a heads up... |
| 8 | Wing sauce | food | Negative | Wing sauce is like water. Pretty much a lot of... |
| 9 | golf range | place | Neutral | Decent range somewhat close to the city. The ... |
| 10 | this place | driving range | Negative | Owning a driving range inside the city limits ... |
| 11 | This place | NaN | Negative | This place is absolute garbage... Half of the... |
| 12 | the range | place | Positive | I drove by yesterday to get a sneak peak. It ... |
| 13 | this store | store | Negative | After waiting for almost 30 minutes to trade i... |
| 14 | This place | place | Positive | This place was DELICIOUS!! My parents saw a r... |
| 15 | Fish Sandwich | food item | Positive | Can't miss stop for the best Fish Sandwich in ... |
| 16 | This place | restaurant | Positive | This place should have a lot more reviews - bu... |
| 17 | Old school | restaurant | Positive | Old school.....traditional \"mom 'n pop\" qual... |
| 18 | fish sandwich | food item | Positive | Good fish sandwich. |
| 19 | Emil's | restaurant | Positive | After a morning of Thrift Store hunting, a fri... |
Example 3: Tagging #
# soflow_df = get_n_rows_from_hf("mirzaei2114/stackoverflowVQA", 20)
soflow_df['Text'] = soflow_df['Title'] + '\n\n' + soflow_df['Body']
soflow_df = soflow_df[['Text', 'Tags']]
soflow_df.head()
| Text | Tags | |
|---|---|---|
| 0 | How do I calculate these statistics?\n\nI'm wr... | [statistics, spss] |
| 1 | Auto Generate Database Diagram MySQL\n\nI'm ti... | [mysql, database, database-design, diagram] |
| 2 | Plugin for Visual Studio to Mimic Eclipse's "O... | [visual-studio, plugins] |
| 3 | How to create a tree-view preferences dialog t... | [c#, user-interface] |
| 4 | Territory Map Generation\n\nIs there a trivial... | [language-agnostic, maps, voronoi] |
possible_tags = soflow_df['Tags'].explode().unique().tolist()
print(possible_tags)
['statistics', 'spss', 'mysql', 'database', 'database-design', 'diagram', 'visual-studio', 'plugins', 'c#', 'user-interface', 'language-agnostic', 'maps', 'voronoi', 'javascript', 'html', 'css', 'textarea', 'prototypejs', 'security', 'captcha', 'asp.net', 'apache-flex', 'actionscript-3', '.net', 'reflector', 'unit-testing', 'configuration', 'vsx', 'extensibility', 'asp.net-mvc', 'forms', 'exception', 'mvp', 'n-tier-architecture', 'eclipse', 'pdf', 'coldfusion', 'file', 'ftp', 'symlink', 'php', 'performance', 'caching', 'firefox', 'drop-down-menu', 'html-select', 'java', 'editor', 'vb.net', 'prettify']
class QuestionTags(BaseModel):
tags: list[Literal[*possible_tags]] | None # type: ignore
tagging_results = await llm_client.call_many(
"Select all the tags that apply to this question. Leave blank if none apply",
soflow_df['Text'],
response_format=QuestionTags
)
soflow_df['inferred_tags'] = [r.tags for r in tagging_results]
soflow_df
| Text | Tags | inferred_tags | |
|---|---|---|---|
| 0 | How do I calculate these statistics?\n\nI'm wr... | [statistics, spss] | [statistics, spss] |
| 1 | Auto Generate Database Diagram MySQL\n\nI'm ti... | [mysql, database, database-design, diagram] | [mysql, database, diagram] |
| 2 | Plugin for Visual Studio to Mimic Eclipse's "O... | [visual-studio, plugins] | [visual-studio, plugins, c#] |
| 3 | How to create a tree-view preferences dialog t... | [c#, user-interface] | [c#, user-interface, .net, forms, mvp] |
| 4 | Territory Map Generation\n\nIs there a trivial... | [language-agnostic, maps, voronoi] | [javascript, html, maps, voronoi, language-agn... |
| 5 | How to autosize a textarea using Prototype?\n\... | [javascript, html, css, textarea, prototypejs] | [javascript, html, textarea, prototypejs, user... |
| 6 | Practical non-image based CAPTCHA approaches?\... | [security, language-agnostic, captcha] | [javascript, html, css, captcha, security, asp... |
| 7 | Calculate DateTime Weeks into Rows\n\nI am cur... | [c#, asp.net] | [c#, asp.net, forms, language-agnostic] |
| 8 | How do I restyle an Adobe Flex Accordion to in... | [apache-flex, actionscript-3] | [html, css, javascript, user-interface] |
| 9 | Have you ever reflected Reflector?\n\nLutz Roe... | [.net, reflector] | [reflector, .net] |
| 10 | How do I run (unit) tests in different folders... | [visual-studio, unit-testing, configuration, v... | [visual-studio, unit-testing, n-tier-architect... |
| 11 | What is the best way to write a form in ASP.NE... | [asp.net-mvc, forms] | [asp.net-mvc, forms] |
| 12 | How Do You Communicate Service Layer Messages/... | [c#, asp.net, exception, mvp, n-tier-architect... | [asp.net, exception, mvp, language-agnostic] |
| 13 | Cannot add a launch shortcut (Eclipse Plug-in)... | [eclipse, plugins] | [eclipse, plugins, java] |
| 14 | Why is my PDF footer text invisible?\n\nI'm cr... | [pdf, coldfusion] | [pdf, coldfusion, html, css] |
| 15 | Cannot delete, a file with that name may alrea... | [file, ftp, symlink] | [php, file, ftp, symlink] |
| 16 | Which PHP opcode cacher should I use to improv... | [php, performance, caching] | [php, performance, caching] |
| 17 | HTML Select Tag with black background - dropdo... | [html, css, firefox, drop-down-menu, html-select] | [html, css, firefox] |
| 18 | Is there a Java Console/Editor similar to the ... | [java, editor] | [java, editor] |
| 19 | Is there a lang-vb or lang-basic option for pr... | [javascript, vb.net, prettify] | [language-agnostic, prettify] |
Easy!
Next Steps #
What’s next for llmpy? I have a few things lined up, but feel free to get in touch with feedback or requests.
- Finish and document embedding utilities.
- Topic modelling example.
- Using the Batch API.
- Analytics engineering use-cases with dbt/dagster.