NuGet.Insights

Analyze NuGet.org packages 📦 using Azure Functions ⚡.

This project enables you to write a bit of code that will be executed for each package on NuGet.org in parallel. The results of the code will be collected into CSV files stored Azure Blob Storage. These CSV files can be imported into any query system you want, for easy analysis. This project is about building those CSV blobs in a fast, scalable, and reproducible way as well as keeping those files up to date.

The data sets are great for:

🔎 Ad-hoc investigations of the .NET ecosystem
🐞 Estimate the blast radius of a bug affecting NuGet packages
📈 Check the trends over time on NuGet.org
📊 Look at adoption of various NuGet or .NET features

The data sets currently produced by NuGet.Insights are listed in docs/tables/README.md.

Quickstart

We follow a 3 step process to go from nothing to a completely deployed Azure solution.

Build the code
Deploy to Azure
Start analysis from the admin panel

Build the code

Ensure you have the .NET 6 SDK installed. Install it if needed.
```
dotnet --info
```

Clone the repository.

git clone https://github.com/NuGet/Insights.git

Run dotnet publish on the website and worker projects. This produces compiled directories that can be deployed to Azure later.
```
cd Insights
dotnet publish src/Worker -c Release
dotnet publish src/Website -c Release
```

Deploy to Azure

PowerShell is used for the following steps. I have tested Windows PowerShell 5.1, Windows PowerShell 7.1.3, and Linux PowerShell 7.1.3.

Ensure you have the Az PowerShell modules. Install them if needed.
```
Connect-AzAccount
```
Ensure you have Bicep installed. Install it if needed.
```
bicep --version
```
Ensure you have the desired Azure subscription selected.
```
Set-AzContext -Subscription $mySubscriptionId
```
From the root of the repo, deploy with the desired config and stamp name.
```
./deploy/deploy.ps1 -ConfigName dev -StampName Joel -AllowDeployUser
```
If you run into trouble, try adding the -Debug option to get more diagnostic information.

This will create a new resource group with name NuGet.Insights-{StampName} deploy several resources into it including:

an App Service, containing a website for starting scans
a Function App with Consumption plan, for running the scans
a Storage account, for maintaining intermediate state and results (CSV files)
an Application Insights instance, for investigating metrics and error logs
a Key Vault for auto-rotating the storage access key

Start analysis from the admin panel

When the deployment completes successfully, a "website URL" will be reporting in the console as part of a warm-up. You can use this to access the admin panel. The end of the output

...
Warming up the website and workers...
https://nugetinsights-joel.azurewebsites.net/ - 200 OK
https://nugetinsights-joel-worker-0.azurewebsites.net/ - 200 OK

You can go the first URL which is the website URL. in your web browser click on the Admin link in the nav bar. Then, you can start a short run using the "All catalog scans" section, "Use custom max" checkbox, and "Start all" button.

For more information about running catalog scans, see Starting a catalog scan.

Running locally

Use one of the following approaches to run Insights locally. Using Project Tye is the easiest if you have Docker installed, otherwise use a standalone Azure Storage emulator.

Using Project Tye

From Project Tye's GitHub page:

Tye is a developer tool that makes developing, testing, and deploying microservices and distributed applications easier. Project Tye includes a local orchestrator to make developing microservices easier and the ability to deploy microservices to Kubernetes with minimal configuration.

It's a great way to run the Insights website, worker, and the Azurite storage emulator all at once with a single command.

Clone the Insights repository.
Install Project Tye if you haven't already.
Make sure you have Docker installed since it is used for running Azurite.
Execute tye run in the root of the repository.
Open the Tye dashboard using the URL printed to stdout, e.g.
```
Dashboard running on http://127.0.0.1:8000
```
From the Tye dashboard, you can navigate to the website URL (shown in the Bindings).

Proceed to the Starting a catalog scan section.

Using a standalone Azure Storage emulator

Clone the repository.
Install and start an Azure Storage emulator for Blob, Queue, and Table storage.
- Azurite: can run from VS Code, npm, and more; make sure to use version 3.19.0 or newer.
- Azure Storage Emulator: this emulator only works on Windows and is deprecated.
Execute dotnet run --project src/Worker from the root of the repository.
From another terminal window, run dotnet run --project src/Website from the root of the repository.
- The website and the worker don't necessarily need to run in parallel, but it's easier to watch the progress if you leave both running.
Open the website URL printed to stdout, e.g.
```
Now listening on: http://localhost:60491
```

Proceed to the Starting a catalog scan section.

Starting a catalog scan

A catalog scan is a unit of work for Insights which runs analysis against all of the packages published during some time range. The time range for a catalog scan is bounded by the previous NuGet V3 catalog commit timestamp used (as an exclusive minimum) and an arbitrary timestamp to process up to (as an inclusive maximum). For more information, see the architecture section.

Once you have opened the website URL in your web browser of choice, follow these steps to start your first catalog scan from the Insights admin panel.

In your web browser, viewing the website URL, click on the "Admin" link in the navigation bar.
Start some catalog scans.
- For your first try, run a single driver against a single NuGet V3 catalog commit.
  - Expand the Load package archive section.
  - Check Use custom max.
  - Use the default value of 2015-02-01T06:22:45.8488496Z, which is the very first commit timestamp in the NuGet V3 catalog.
  - Click Start.
- You can start all of the catalog scans with the same timestamp using the "All catalog scans" section but this will take many hours while running on your local machine. There are a lot of drivers and a lot of packages on NuGet.org 😉.
Make sure the background worker is running (either via Tye or starting the Worker project from the terminal).
Wait until the catalog scan is done. You can check the current progress by refreshing the admin panel and looking at the number of messages in the queues (first section in the admin panel) or by looking at the catalog scan record created in the previous step.

If you ran a driver like Load package archive, data will be populated into your Azure Table Storage emulator in the packagearchives table. If you ran a driver like Package asset to CSV, CSV files will be populated into your Azure Blob Storage emulator in the packageassets container.

You can use the Azure Storage Explorer to interact with your Azure Storage endpoints (either the storage emulator running locally or in Azure).

When running locally, uou can check the application logs shown in the Tye dashboard or terminal stdout. When running in Azure, you can use Application Insights (note the default logging is Warning or higher to reduce cost). You can also look at the Azure Queue Storage queues to understand what sort of work the Worker has left.

Documentation

Tables - documentation for all of the data tables produced by this project
Adding a new driver - a guide to help you enhance Insights to suit your needs
Reusable classes - interesting or useful classes or concepts supporting this project
Blog posts - blog posts about lessons learned from this project
Cost - approximately how much it costs to run several of the implemented catalog scans

Projects

Here's a high-level description of main projects in this repository:

Worker - the Azure Function itself, a thin adapter between core logic and Azure Functions
Website - a website for an admin panel to managed scans
Worker.Logic - all of the catalog scan and driver logic, this is the most interesting project
Logic - contains more generic logic related to NuGet.org protocol and is not directly related to distributed processing

Other projects are:

Forks - download, patch, and list code from other open source projects
SourceGenerator - AOT source generation logic for reading and writing CSVs
Tool - a command-line app used for pretty much just prototyping code

Architecture

The purpose of this repository is to explore the characteristics, oddities, and inconsistencies of NuGet.org's available packages.

Fundamentally, the project uses the NuGet.org catalog to enumerate all package IDs and versions. For each ID and version, some unit of work is performed. This unit of work can be some custom analysis that you want to do on a package. There are some helper classes to write the results out to big CSVs for importing into Kusto or the like but in general, you can do whatever you want per package.

The custom logic to run on a per-package (or per catalog leaf/page) is referred to as a "driver".

The enumeration of the catalog is called a "catalog scan". The catalog scan is within a specified time range in the catalog, with respect to the catalog commit timestamp. A catalog scan finds all catalog leaves in the provided min and max commit timestamp and then executes a "driver" for each package ID and version found.

All work is executed in the context of an Azure Function that reads a single worker queue (Azure Storage Queue).

The general flow of a catalog scan is:

Download the catalog index.
Find all catalog pages in the time range.
For each page, enumerate all leaf items per page in the time range.
For each leaf item, write the ID and version to Azure Table Storage to find the latest leaf.
After all leaf items have been written to Table Storage, enqueue one message per row.
For each queue message, execute the driver.

Note there is an option to disable step 4 and run the driver for every single catalog leaf item. Depending on the logic of the driver, this may yield duplicated effort and is often not desired.

The implementation is geared towards Azure Functions Consumption Plan for compute (cheap) and Azure Storage for persistence (cheap).

Workflow

The driver code is chained together with other operational tasks in a sequence of steps called a workflow. The workflow is run on a regular cadence (e.g. daily). The workflow performs these step for each iteration:

Run all catalog scans to read the latest information NuGet.org catalog.
- Some catalog scans can run in parallel, others depend on each other.
Clean up orphan records.
- Example of orphan record: a certificate that was only referenced by a package that was deleted.
Update auxiliary files.
- These data sets contain some info about all packages in a single file.
Import the updates CSVs to Kusto (Azure Data Explorer).
- This performs an import of all CSV blobs to new tables and then does an atomic table swap.

If any of these steps does not complete, the workflow hangs and no further worflows can start.

Drivers

The current drivers for analyzing NuGet.org packages are:

CatalogDataToCsv: extract data found in the catalog to CSV, e.g. deprecation and vulnerability
NuGetPackageExplorerToCsv: run NuGet Package Explore APIs to determine reproducibility
PackageArchiveToCsv: find info about all ZIP entries in the .nupkg
PackageAssemblyToCsv: find stuff like public key tokens in assemblies using System.Reflection.Metadata
PackageAssetToCsv: find assets recognized by NuGet restore
PackageCertificateToCsv: determine relationships between packages and certificates
PackageCompatibilityToCsv: determine package compatibility with several algorithms
PackageIconToCsv: get image metadata for embedded icons and icons from URLs
PackageManifestToCsv: extract known data from the .nuspec
PackageReadmeToCsvDriver: extract package README metadata and content
PackageSignatureToCsv: parse the NuGet package signature
PackageVersionToCsv: determine latest version per package ID
SymbolPackageArchiveToCsv: find info about all ZIP entries in the .snupkg

Several other supporting drivers exist to populate storage with intermediate results:

BuildVersionSet: serializes all IDs and versions into a Dictionary, useful for fast checks
LoadLatestPackageLeaf: write the latest catalog leaf to Table Storage
LoadPackageArchive: fetch information from the .nupkg and put it in Table Storage
LoadPackageManifest: fetch the .nuspec and put it in Table Storage
LoadPackageReadmeDriver: download package READMEs and put them in Table Storage
LoadPackageVersion: determine listed and SemVer status and put it in Table Storage
LoadSymbolPackageArchive: fetch information from the .snupkg and put it in Table Storage

Several message processors exist to emit other useful data:

DownloadsToCsv: read downloads.v1.json and write it to CSV
OwnersToCsv: read owners.v2.json and write it to CSV
VerifiedPackagesToCsv: read verifiedPackages.json and write it to CSV

Several message processors are used for aggregating or automating other processes:

CsvCompact: aggregate CSV records saved to Table Storage into partitioned CSV blobs
KustoIngestion: orchestrates ingestion and validation of CSV blobs into Kusto (Azure Data Explorer) tables
CleanupOrphanRecords: removes records that are marked as orphans from the ReferenceTracking tables
Workflow: orchestrates the entire workflow (as mentioned in the Architecture section above)

Screenshots

Resources in Azure

These are what the resources look like in Azure after deployment.

Azure Function running locally

This is what the Azure Function looks like running locally, for the Package Manifest to CSV driver.

Results running locally

This is what the results look like in Azure Table Storage. Each row is a package .nuspec stored as compressed MessagePack bytes.

Admin panel

This is what the admin panel looks like to start catalog scans.

Load Package Archive

This is the driver that reads the file list and package signature from all NuGet packages on NuGet.org and loads them into Azure Table Storage. It took about 35 minutes to do this and costed about $3.37.

Azure Functions Execution Count

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

NuGet / Insights

Programming Languages

NuGet.Insights

Quickstart

Build the code

Deploy to Azure

Start analysis from the admin panel

Running locally

Using Project Tye

Using a standalone Azure Storage emulator

Starting a catalog scan

Documentation

Projects

Architecture

Workflow

Drivers

Screenshots

Resources in Azure

Azure Function running locally

Results running locally

Admin panel

Load Package Archive

Azure Functions Execution Count

Azure Functions Execution Count

Trademarks