For a while now, we’ve been working on some major improvements to search. Last week we deployed these improvements to production on http://www.nuget.org. In this post, I’ll describe how the new service works. However, before we discuss the new service, let’s step back a bit and discuss the history of Search on NuGet.org
Search via SQL
Our first search implementation was done using our existing OData endpoints. We simply used the OData Query Operators to filter data in our database by the User’s query. This had two major problems: It was fairly inefficient. SQL is a good engine for data lookup, but by default it is not as efficient at full-text search operations, like we wanted on NuGet.org. Second, it was inaccurate. Again, relational databases are great at lookup and querying, but only when the query is similarly structured. Our queries to SQL ended up as just a series of
LIKE comparisons for each field in our database
So, we opted to use a tool that was designed explicitly for full-text search to augment our SQL database
Lucene on the Web Server
Our second implementation was very much seen as an optimization for a very specific pair of cases: The search box in the NuGet Visual Studio Dialog, and the search box on the NuGet.org Website.
Both of those views focus solely on the latest version of a package. So, our initial optimization was to take the latest version of each package and add it to a Lucene.NET powered index running off the local file system of each web server. Each machine had a background task running that synced its local Lucene index with data from the database. Search queries were then executed against that index. The result was a dramatic speed and relevance improvement. Lucene provides very powerful boosting and scoring extension points and we integrated some of our statistics in to that process.
However, there were still a few problems:
- Each machine had its own copy of the index, meaning a machine could get out of sync with the others
- Our scoring algorithms were based off of total download counts, meaning new packages had a tough time getting noticed due to download-count-behemoths who have been in the gallery for years.
- Finally, we wanted to expand the kinds of queries we could do with the Lucene index, but because the index only contained the latest version of each package, we were restricted to working with that set.
- Having the index on the web server was useful, but constrained us to maintaining a smaller index in order avoid stressing the web server too much.
To solve those problems, we started work from the ground up on a new search infrastructure.
NuGet Search Service
To combat the issues we had with Lucene integrated into the web server, we developed an entirely new Search Service, written from the ground up. The service, like all NuGet.org code, is completely open-source on GitHub. The new Search Service is a separate HTTP service that is responsible for answering search queries. By moving search to its own set of machines, we hoped to reduce and even remove most of the issues we encountered with the previous search models.
First, the search service stores the master copy of the Lucene index in Azure Blob Storage using the Azure Directory for Lucene.NET library. Of course, accessing the index from Blob Storage would be very inefficient, so the entire index is kept in memory on the Search Service machines at all times. It is frequently synced with the Blob Storage copy, but in general, most queries should be served directly from memory. Storing the index in a central location allows it to be updated in a single location, while the memory cache allows queries to be served very quickly. We track round-trip times between the Gallery and this new Search Service and the average seems to be hovering around 80ms, which is as fast as most of our database queries (and even faster than some of the heavier ones!). Storing the authoritative copy of the index in blob storage also gives us a few major benefits. For example, we can easily spin up new Search Service nodes and they will just grab the latest index from blob storage (loading the whole index into memory takes about 2-5 minutes). Also, our write operations into the index (Adding new packages, deleting packages, updating existing packages, etc.) can be centralized and need only update the blobs.
We also dramatically increased the scope of the Index. It now contains every single version of every package ever uploaded (see note below). At the time of this post, that comes to about 254,886 documents (the number only differs from the total package count on the Gallery due to packages being unlisted by their authors, something which we frequently do with test packages as well ;)). Despite the size, the total index size comes out to around 600MB. Of course, when the index was on the web server, this would be an unacceptable about of memory pressure to add, but on the Search Service, we have free reign over the entire machine! As the index grows, we can safely continue to scale up the available memory by moving to more memory-intensive Azure VM profiles. Of course, scaling up isn’t a perfect solution, so we will continue to monitor memory growth, but at our growth rates, it’s going to be the ideal solution for a long time.
Note: We do still hide unlisted packages from search queries. However, this is a good time to remind everyone that unlisting is not a secure way to remove data from NuGet. It is a mechanism to reduce a package’s visibility, not a way to prevent download of your package. If you need your package to be completely removed, use the Contact Support link on your package page to request that we delete the package.
Lastly, we overhauled our scoring and analysis algorithms. This work started back in June of 2013, with the release of NuGet 2.6. In that release, the client began supplying an additional HTTP header when it requested a package for download:
NuGet-Operation. This header contained a value indicating what the user was doing in order to cause the package to be downloaded. Once NuGet 2.6 started sending this data, we began collecting download data into a data warehouse and categorizing it based on many different aspects, including this Operation value. This began to manifest in the Package Statistics pages you can view from each Package detail page. The next step was to take this data and use it in scoring.
Whereas our previous algorithm scored results based on total download count, our new algorithm scores them based on “Recent Installs.” Specifically, it uses the number of downloads requested with a
NuGet-Operation value of
Update within the past 6 weeks. This allows really popular packages like jQuery, EntityFramework and Newtonsoft.Json to remain fairly high up, because they are being Installed frequently, but also allows new packages to climb the ranks a little faster by shortening the time window and giving them a chance to catch up. Filtering by the
Update operations also allows us to filter out the noise caused by build servers using Package Restore to download packages on every build.
John Taylor, an engineer on our team, spent the last few months of 2013 diving in to Lucene scoring and fiddling with parameters trying to nail down some of the best ways to score results. We had a few smaller-scale tests where we released some sample algorithms to progressively larger groups of people to get feedback. In the end, we managed to nail down an algorithm which gave us confidence that we could handle most of our requests efficiently and accurately.
Still, testing with a broad audience doesn’t cover everything, so we know there are going to be gaps. Please do not hesitate to give us feedback on our results by filing bugs or pinging us on twitter. Tuning our search algorithm will be a never-ending process so keep telling us what you think!
This blog post was planned to be published on April 2nd as the NuGet 2.8.1 release announcement. However, on that same day (also the first day of Build 2014), NuGet.org suffered a severe service interruption. It didn’t seem right to blog about the NuGet 2.8.1 release without also covering the interruption, so we waited a day and combined the posts.
NuGet 2.8.1 Released with Windows Phone 8.1 Support
Let’s cover the fun stuff first! On April 2nd, we released NuGet 2.8.1 to the Visual Studio Extension Gallery. You can get the updates from within Visual Studio’s Extensions and Updates dialog, or directly from the extension gallery. We also published NuGet.exe 2.8.1.
Here are the downloads:
- Visual Studio 2013: Visual Studio Extension Gallery
- Visual Studio 2010 and 2012: Visual Studio Extension Gallery
- Command-Line Utility: Direct Download
NuGet 2.8.1 includes support for Windows Phone 8.1, including both Silverlight-based libraries and WinRT-based libraries for Universal Apps. For Silverlight-based Windows Phone 8.1 libraries, packages use the “wp81” framework name. For WinRT-based Windows Phone App 8.1 libraries, packages use the “wpa81” framework name.
In addition to the Windows Phone 8.1 support, we also fixed over a dozen bugs–mostly in nuget.exe. See the release notes for other details about the release.
April 2nd-3rd Downtime
As we tweeted about, we learned we had a few more vulnerabilities to heavy load than we had previously understood. Perhaps due to the Build conference, or perhaps just a coincidence, www.nuget.org was experiencing unusually high browser traffic early in the morning on both April 2nd and April 3rd. The extra load ultimately led to interruptions in most of our services.
During the times of interruption, the following services were impacted:
- The website’s Packages page showed 0 packages
- Search on the website and in Visual Studio reported no packages
- The feed for Visual Studio reported no packages
- Users were unable to sign into the website or upload packages
- Some users’ package restore operations failed
The www.nuget.org home page makes http requests to a /stats/totals endpoint that performs a query to get the home page statistics to show the number of packages and downloads. The request was configured to be cached, but the cache wasn’t behaving as we expect. This resulted in SQL queries for each request. Under heavy load, these requests backed up and the queries became locked on each other.
Additionally, all of our web server instances were maintaining their own copies of our Lucene search index. On a schedule, the servers would all query the database to update the index as needed. These queries are expensive and as we scaled out to more instances under load, the queries were running frequently. With so many expensive queries running, these queries started timing out and causing the search indexes to become corrupt.
In order to reduce the load on our SQL database, we have implemented two changes that are now deployed to www.nuget.org.
- We have completely disabled the home page query (instead returning static numbers for the time-being).
- We have deployed our new Search Service which was planned to be released next week.
New Search Service
At 8:00pm PDT on April 3rd, we deployed a significant update to www.nuget.org that changes the search implementation to use a dedicated Search Service.
We had planned to deploy this new search service next week, but the downtime we encountered April 2nd-3rd changed our plans. The goals of the new Search Service include reducing load on our SQL Azure database and moving our Lucene search index out of the web servers and into Azure Blob Storage. The Search Service runs independently, reading the index from blob storage, and the index is now maintained by back-end processes rather than on the web servers themselves. We’re also able to direct more queries to the search service than we could handle with our previous index, relying less on SQL.
The root cause analysis of our downtime uncovered that the SQL Azure load was the primary culprit, with the Lucene index updates being a significant contributor. This new Search Service allows us to control the SQL load from our backend processes rather being tied to web traffic.
We will publish a detailed blog post next week, but here’s what to expect from the new Search Service:
- Search relevance has been completely overhauled. We now boost search results by “recent installs” (the last 6 weeks of direct installs/updates) and we have drastically improved text analysis of package metadata.
- Sort Options have been removed from the website. Sorting by recent installs now produces the best results. The sort options remain in Visual Studio and are still respected by our API.
- The dropdown for “Include Prerelease” and “Stable Only” has been removed from the website. As Prerelease packages have become more popular, this feature tended to caused confusion rather than provide benefit. The dropdown is still in use in Visual Studio.
- The search box on the website is now much bigger, promoted into the header, and more user-friendly. Given search is the primary use of the site, this change was long overdue!
Please let us know what feedback you have on the new Search Service.
Downtime Timeline and Details
Here is our timeline of the downtime. All times are Pacific Daylight Time.
- 11:30pm - Some users started reporting incomplete search results late in the night on April 1st. This appeared to just be our search index’s eventual consistency (although we now know that wasn’t the case).
- 12:00am to 4:00am - The index shrank all the way down to 0 packages during this time window.
- 4:00am - The index’s state of 0 packages was affecting the website’s Packages page, and search on the website and in Visual Studio, and the Visual Studio feeds–everything reported 0 packages. This triggered our automated alerts and it began paging us.
- 4:45am - The interruption was confirmed; investigation had begun; and the initial notification was posted to twitter.
- 5:30am - Some background services had been shut down to reduce load on the system. We also found that package download statistics had not been getting automatically purged after migration into the warehouse, so a manual purge had begun.
- 7:00am - Azure Support was engaged and they began helping us identify blocked queries and execute some SQL maintenance scripts to recalculate DB statistics and rebuild indexes.
- 8:40am - We deployed an update to www.nuget.org that removed an unnecessary http request to /stats/totals from every page. This request had been identified as triggering a SQL query that was causing locks.
- 9:00am - All SQL indexes finished rebuilding and all DB statistics were recalculated. The package download statistics purge was over 33% complete.
- 10:30am - We found that our Output Cache was not working for the /stats/totals request, and every single user hitting the www.nuget.org home page was resulting in SQL queries to calculate the totals.
- 10:45am - We deployed another update to www.nuget.org that hard-coded the values for the /stats/totals request (that serves the home page numbers) instead of querying SQL.
- 11:00am - The load on our SQL database was drastically reduced and connectivity issues went away. We rebuilt our search indexes successfully and the site was fully functional again.
- 12:20pm - The package download statistics purge completed and all background services were resumed.
- 11:55pm - There were reports that our index was shrinking again. We successfully rebuilt them manually.
- 4:00am - Our monitoring alerted us that the search index was corrupt again.
- 4:30am - Our indexes were manually rebuilt successfully.
- 7:30am - We were under heavy load again and our web servers were scaling out.
- 7:45am - Some of our new web server instances failed to build their search indexes, manual rebuilds failed, and other web server instances’ indexes became corrupt.
- 8:00am - Many DB queries were failing; we engaged Azure support.
- 9:00am - Diagnosis determined that the queries running to rebuild and update the Lucene indexes were significantly contributing to the SQL load. We began the work to finalize the Search Service deployment that was planned for next week.
- 10:45am - Traffic dropped off, the web servers scaled back down, search indexes were successfully rebuilt.
- 11:00am - We began preparing a backup strategy deployment that would still use our web server based search indexes, but would disable all automatic updates to the index (limiting the index updates to manual rebuilds only).
- 11:00am-8:00pm - The website ran steadily all day under normal load. We continued our work and testing on the new Search Service while also validating our backup strategy deployment in case the Search Service deployment doesn’t go smoothly (we were planning for another week of testing).
- 8:00pm - We deployed the new Search Service and the updated www.nuget.org front-end that uses the Search Service. We also deployed our backup strategy deployment into the site’s staging slot in case we need to switch to it.
The deployment we completed tonight should very significantly reduce the load on SQL by:
- Not performing queries to get the home page statistics for every page load.
- Running a single search index update process in the back-end instead of one on each web server.
- Sending more requests into our search service instead of directing them to SQL.
Work Items Discovered
While working through this incident, we identified a handful of work items that we will tackle immediately.
- Determine why the ASP.NET caching isn’t working for the /stats/totals request
- Replace the /stats/totals query with a background service that calculates the totals on a schedule and stores the results in blob storage to be served from the gallery rather than ever calculating the numbers directly
- Determine why Package Restore was sporadically failing under the database load – as this was thought to be immune from database exceptions
- Schedule more DB maintenance tasks through our background services to rebuild indexes and recalculate DB statistics
- Fix the bug leading to our package statistics not being purged after being replicated into the warehouse database
- Complete the work for having a read-only mirror up and running for when our primary site is down • This work was already ongoing, but we are increasing the priority of it to have it finished very soon
We appreciate that while many of you reported the service interruption on twitter using either @nuget or #nuget, you were kind and polite about it. We know that there are countless developers around the world that rely on nuget.org being up and running all day every day, and we’re sorry that we let you down. We are working to improve our reliability.
Thank you for your support,
Jeff Handley and the entire NuGet team
The NuGet team released an updated NuGet Package Manager extension for WebMatrix on March 26, 2014. This update can be installed from the WebMatrix Extension Gallery using the following steps:
- Open WebMatrix 3
- Click the Extensions icon in the Home ribbon
- Select the Updates tab
- Click to update NuGet Package Manager to 2.6.1
- Close and restart WebMatrix 3
Here are the salient points from the release notes.
This extension update addresses two of the biggest issues users have faced consuming NuGet packages within WebMatrix. The first was a NuGet schema version error and the second was a bug leading to zero-byte DLLs in the bin folder.
This latest release provides compatibility with the newest NuGet packages, preventing the schema error from occurring. New versions of packages including Microsoft.AspNet.WebPages can now be installed in WebMatrix. Some of these packages were using NuGet features such as XDT config transforms, which wasn’t supported in WebMatrix until now.
Other Recent Improvements
When NuGet Package Manager 2.8 was released for Visual Studio, we also released NuGet Package Manager 2.5.0 for WebMatrix. While this was mentioned in the NuGet 2.8 Release Notes, we didn’t mention the specific new features that update introduced.
Those improvements include the ability to update all NuGet packages in your web site together, as well as getting a prompt to overwrite existing files when installing NuGet packages.
Because we extracted the NuGet functionality out of WebMatrix 3 into an extension contributed to the NuGet open-source project, we’ve been able to make these important updates without requiring an update to WebMatrix itself. If you have other issues using NuGet within WebMatrix, let us know by filing issues on CodePlex so that we can address them.