Boost your ML & AI accuracy with thousands of ready-to-use ML features from external data sources

Discover and integrate new relevant features auto-generated
by LLMs and greedy feature engineering algorithms from
200+ public, community, and premium data sources

Boost your ML & AI accuracy with thousands of ready-to-use ML features from external data sources

Discover and integrate new relevant features auto-generated
from 200+ public, community, and premium data sources,
including open & commercial LLMs




Trusted by data scientists and data engineers

❓Why use Upgini

Automated data source optimizations for ML models:

Automated feature generation for text fields with Large Language Models data augmentation

Instructed embeddings generation using LLMs (such as GPT) with data augmentation from connected external sources.

If properly prompted with context from all relevant external data, an LLM significantly improves the quality of its embeddings for text field in a source.

Automated feature generation with special GraphNN and RNN


Automated feature generation for transactional and graph data sources through specialized RNN and Graph NN for accurate information extraction on sequences and object relationships in the data source.

Open Street Map is an example of graph data source

Multiple data sources ensembling to minimize data errors


Data is not perfect, and different sources, even with the same type of information, have their own errors.

Thus, if multiple sources with different error distributions are used, their ensemble will have better accuracy. This is similar to a consensus forecast.

Iterative search with automatic search keys augmentation from all connected sources

For example, if you lack geographic location information for an IP address, Upgini will search for cross-mapping of IDs in the sources.

If it finds the relevant information, it will automatically add a new search key - in this case, the postal code for each IP. This enables searching through all geo data sources in addition to IP sources.

Generative AI
for Enrichment & Automated
Feature Engineering

How this works

Upgini Generative AI can automatically enrich any text fields with relevant facts from external data sources and generate ready-to-use numeric features from enriched representations of text fields.

Upgini Gen AI does this in the following steps:

(1) Finds entities in the text to match facts from external data sources. Simple examples are the company name, car model, product title, and place of interest (POI) name.

(2) Detects contextual information for extracted entities from external sources. A simple example is a geographical location for a company.

(3) Generates enriched embeddings for text fields using a process similar to Retrieval-augmented generation (RAG) with facts fetched from external sources and specific contextual information.

These enriched embeddings for text fields can be used as numeric features to enhance the accuracy of downstream ML models.

Example

We want to improve the accuracy of an ML model that predicts the probability of product usage decline for a specific client (what is called an attrition or churn model).

For every client in a labeled training dataset, in addition to numeric ML features, we have transcriptions of all calls to support, the history of the client's chats with support, and purchased product reviews from the website.

Simply pass this text as columns in a labeled dataset during the search process, and Upgini will automatically enrich these columns with relevant external sources and generate Enriched embeddings.

Accuracy

Enriched embeddings from Upgini Generative AI are more accurate than embeddings from top commercial embedders, such as OpenAI’s ada-002.

More details on the comparison can be found in the Medium publication.

Connected data sources

200+ Public, Community and Premium sources
239 countries
40 years of data history

🌐 Public data

Historical weather & Climate normals for postal/ZIP code

68 countries
22 years history
Monthly update

Air temperature
Precipitation
Wind
Air pressure
Normals
Sun hours
Moon phase

Location/Places/POI/Area/Proximity
from OpenStreetMap
for postal/ZIP code
221 countries
2 years history
Monthly update

POI Categories:
Schools, restaurants, hotels, supermarkets, etc
Houses:
Living buldings, business centers, etc
Transport infrustructure:
Roads, public transport stops, etc
Public facilities:
Gov. offices, post office, police, etc
Natural features:
Public parks, green areas, etc
Stats for different distances (1 km / 3 km / 5 km)

International holidays & events, Workweek calendar

232
countries
22 years history
Monthly update

Workweek calendars by countries
Public holidays / Observed holidays
Religious holidays
Sporting events
Political events

Consumer Confidence index


44
countries
22 years history
Monthly update

World economic
indicators

191 countries
41 years history
Monthly update

Consumer Price index
GDP
Сentral Bank Rates
Сommodities prices

Markets
data

17 years
history
Monthly update

Stock prices
Stock volumes
Currencies and exchange rates
Market indexes

👩🏻‍💻 Community shared data

World demographic data
for postal/ZIP code


2
sources for ensemble
90
countries
Annual update

Residential population
Income
Home value
Home ownership
Employment
Industries
Occupations
Population mobility

Public social media profile data
for email & phone


600+
mln phones
350+ mln emails
104 countries
Monthly update

Estimated age
Gender, nationality
Residence & zip/postal code
Maritial status
Employer, job title
Duration of employment
Interests

World mobile & fixed broadband network coverage and perfomance
for postal/ZIP code

4 sources for ensemble
167
countries
Monthly update

Mobile network coverage statistics
Fixed broadband and mobile network performance metrics - download/upload speed, latency
Estimated number of mobile phones & PCs
Statistics for different distances
(1 km / 3 km / 5 km)

Car ownership data and
Parking statistics
for postal/ZIP code
email & phone

3 countries
Annual update

Car Brand
Car Model
Year statistics
Parking statistics by:
Brand, Model

Geolocation profile
for IPv4 & phone


6
sources for ensemble
2^32
IP
600+
mln phones
239 countries
Monthly update

Country
Region
City
Postal/ZIP code
ISP / ASN
Proxy/VPN/Datacenter flag for IP

World house prices
for postal/ZIP code


3
sources for ensemble
2
countries
Annual update

House price index for countries
House price index for zip/postal code

🛒 Premium data providers

Don’t see the data source you need?
Let us know, we’ll add that!

🔎 Search and enrichment for 6 entity types

Dateor DateTime
CountryISO 3166 codes
Postal/ZIP code900 000+ unique codes
Phone number600 mln+ phone numbers
Hashed email (HEM)350 mln+ emails
IP-address2^32 ip-addresses

🏁 Get started with Python

Step by step guide

#1

Install Upgini library

... from PyPI and check out our documentation on GitHub (it's open-source)

#2

Select data enrichment keys
and initiate feature search

You can reuse your existing labeled training dataset
Only relevant features that give metric improvement (ROC AUC, RMSE, etc.) returned, not just correlated with the target variable.
Without API Key With Free API Key

#3

Enrich ML model with new features and retrain

10-25% accuracy improvement to baseline results from mainstream AutoML frameworks

#4

Add external features into production ML pipeline

Enrich production datasets with actual features/data for the present time
arlington-research-Kz8nHVg_tGI-unsplash.jpg

Contact us

Our team of ML and AI experts will be happy to answer your questions