Working with Azure Search in C#.NET

Standard

What is Azure Search

Azure Search is a service in Microsoft Azure Cloud Platform providing indexing and querying capabilities. Azure Search is intended to provide developers with complex search capabilities for mobile and web development while hiding infrastructure requirements and search algorithm complexities.

I like the idea of having an index service provider being part of my solution as it allows developers to perform searches quickly and effortlessly without going through the pain of writing essay-length SQL query.

The key capabilities of Azure Search include scalable full-text search over multiple languages, geo-spatial search, filtering and faceted navigation, type-ahead queries, hit highlighting, and custom analyzers.

azure-search-daniel-foo

Why Azure Search (or index provider in general)

  1. It’s Brilliant – Having an index provider sitting in between application and database is a great way to shield the database from unnecessary or inefficient Read requests. This allows database I/O and computing power to be reserved for operations that truly matters.
  2. It’s Modern – Traditional search approach requires developers to write essay-length SQL query to retrieve simple aggregated data. Index provider such as Azure Search allows developers to initiate a search request without writing complicated search algorithm (for example geo-spatial search) and no complex SQL query is required (for example faceted navigation).
  3. It Scales – Solutions that gain the most benefited from index provider are relatively larger enterprise system. Such system often require scaling to certain extain. Scaling Azure Search in Microsoft Azure Cloud Platform is several clicks away compared to having an on-premise service provider such as Solr.
  4. It’s FREE – Microsoft Azure provides a free tier for Azure Search. Developers can create and delete Azure Search service for development and self-learning purpose.

Use Case

A car classified site is a good example to make use of index service provider such as Azure Search. I will use a local car classified site, Carlist.my to illustrate several features the system can potentially tap into.

Type-ahead Queries allow developer to implement auto suggestions as user types. In the following example, as I was typing “merce”, the site returns me a list of potential Mercedes-Benz car model that I might be interested in.

type-ahead-queries

Facet allow developers to retrieve aggregated data (such as car make, car model, car type, car location, car price range, etc) without writing writing complex SQL query, which also means saving the Read load at database. In the following example, the site returns me a list of Mercedes-Benz model and the count in the bracket to indicate how many classified ads are available for the specific model.

facet

Filter allows developer to retrieve documents that fit the searching criteria without writing complex SQL queries with endless INNER JOIN and WHERE clauses. In the following example, I specified that I want all the used Mercedes-Benz, model of E-Class E200, variant of Avantgarde from Kuala Lumpur with the price range of RM100,000 to RM250,000. You can imagine the kind of INNER JOIN and WHERE clauses the poor developer has to design dynamically if this were to be retrieved from a database directly.

filter

Another feature the car classified site can potentially tap into is Geo-spatial Search although it is not seen implemented. For example if I were to search for a specific car in a dealership, the portal can suggest similar cars from other dealerships nearby to the dealership I’m looking at. That way when I make a trip to visit a dealership, I can also visit other nearby dealerships that have the similar cars.

Using C#.NET to work with Azure Search

Let’s roll our sleeves and hack some codes. I will be using a C#.NET console application to illustrate how we can design and create an index, upload documents into the index and perform several types of searches on the index. This solution will be used to simulate some of the potential codes required by a car classified portal.

First, we create a solution name AzureSearchSandbox.

We will need “Microsoft.Azure.Search” NuGet package from NuGet.org. Open your Package Manager Console and run the following command:

Upon successful installation, you will see several NuGet packages are added into your packages.config file in your solution.

Note that you will only need to install “Microsoft.Azure.Search”, the other packages are dependencies. The dependencies are resolved automatically. 

In your solution, add a new class Car.cs

This Car object will be used to represent your Azure Search document.

Next we create a static Helper.cs class to take care of the initialization of the index in Azure Search.

Initialize() is the main method to kick start our Azure Search index. Unlike other on-premise index service that require certain amount of setup such as Solr, it doesn’t take long to have our index up and running. In Solr, I have to install Solr using NuGet, install the right Java version, set the environment variable and finally create a core in Solr admin portal. With Azure Search, no upfront set up is required.

The index is created in CreateIndex() method, where we tell Azure Search client SDK that we want an index with the fields we define in our Index object.

To ensure this set of code is running on a fresh index, we have DeleteIfIndexExist() method to ensure the previous index is removed. We call this right before CreateIndex() method.

Next, we add a new class Uploader.cs to deal with the documents we are about to upload into our Azure Search index.

PrepareDocuments() is a simple method to construct a list of dummy Car object for our searches later on.

Upload() method gets the dummy Car objects from PrepareDocuments() and pass these object into Azure Search client SDK to upload the documents into index in a batch. Note that we added a 2000 millisecond sleep time to allow our service to upload and process car documents properly before moving on to next part of the code, which is search. However in practical sense, we would not want to add sleep time in our upload code. Instead, the component that takes care of searching should expect index is not available immediately. We also catch IndexBatchException implicitly to handle the index in case the batch upload of index failed. In this example, we merely output the index key. In practical sense, we should implement a retry or at least logging the failed index.

Once the index upload operation is completed, we will add another class Searcher.cs to take care of the searching capability.

SearchDocuments() method is to handle the searching mechanism on the index we created earlier. No fancy algorithm, only passing specific instruction to Azure Search client SDK on what we are looking for and display them. In this method, we take care simple text search, filter and facets. There are much more capabilities Azure Search client SDK can provide. Feel free to explore the SearchParameters and response object on your own.

Putting them all together in Program.cs

First we define index service name and API key to create a search index client instance. The instance is returned by Helper.Initialize(). We will make use of this instance for both search and upload later.

After initializing the index, we call Upload() method to upload some dummy car documents to the index.

Next, we perform the following searches:

  1. Simple text search. We will search for the text “Perodua” in the documents.

The result as following. Azure Search index returns 2 documents which contains the keyword “Perodua”

perodua

2. Using Filter. A more targeted and efficient approach to look for documents. In the following example, first we look for Category field which is equal to ‘Hatchback’; second we look for Price field which is greater than 100,000 and is a ‘Sedan’ category. More details on how to write expression syntax in Azure search.

The result as following: Car with the category of Hatchback and car cost more than 100,000 and is a Sedan.

filter

3. Searching facets. With facets, developer will no longer need to write long query that combines Count() and endless GROUP BY clauses.

If we have a traditional database table that represent the dummy car data, this is equivalent to “SELECT Category, Count(*) FROM Car GROUP BY Category”.

Result as following:

facets

This example might not look like a big deal if you have a small data set or simple data structure. Facet is super handy and fast when you have large number of data and when your query is getting more complex. The ability to define which facet is required in C# codes make the”query” much cleaner and easier to maintain.

One last thought…

You can clone the above source code from my GitHub repo. I will be happy to accept Pull Request if you want to demonstrate how to make use of other capabilities in Azure Search client SDK. Remember to change the API key and service name before compiling. Have fun hacking Azure Search!

Scaling SQL Server

Standard

Scaling database refers the ability to serve significantly more request to both read and write data without compromising performance. In many enterprise applications, performance bottleneck often happens at database, hence scaling database is a critical part on improving system performance. In the last article on Microservices, we discussed scaling database horizontally and vertically on high level. In this article we will talk more in-depth about:

  1. Three types of replication in SQL Server
  2. Distribute database load using log shipping
  3. Tools to shield your database from being hit.

3 Types of Replication

In typical enterprise applications, Read request significantly outnumber Write request. By implementing Replication, you effectively offload the bulk of Read request to the Subscribers while reserving Publisher for Writing.

Transactional Replication

Transactional Replication is the simplest form of replication to understand and to implement. Transactional Replication is implemented by having a Publisher to publish the changes. One or more than one Subscriber will replay the transaction log. Data changes and schema modifications made at the Publisher are delivered to the Subscriber(s) as they occur (almost real time). In this way, transactional consistency is guaranteed.

The incremental changes in Publisher will be propagated to Subscribers as they occur. If a row changes for 3 times in Publisher, the Subscriber will also change for 3 times. It is not just the net data change that get propagated over.

For example, if a row in Product table changes price three times from $1.00, to $1.10, to $1.20 and finally to $1.30, transactional replication allows an application to respond to each change. Perhaps, send a notification to user when the price hit $1.20. It is not simply the net data change to the row by changing the price from $1.00 to $1.30. This is ideal for applications that require access to intermediate data states. For example, a stock market price alert application that tracks near real time stock price changes to send price alert to users.

Is it possible to scale your Publisher horizontally? Yes, Bidirectional Transactional Replication and Peer-to-Peer Transactional Replication will help you to achieve it. However, Microsoft strongly recommend that write operations for each row be performed at only one node – for 2 reasons. First, if a row is modified at more than one node, it can cause a conflict or even a lost update when the row is propagated to other nodes. Second, there is always some latency involved when changes are replicated. For applications that require the latest change to be seen immediately, dynamically load balancing the application across multiple nodes can be problematic.

From experience, the most optimum solution is to scale your Subscriber horizontally by having multiple nodes. Keep your Publisher in one node and scale the Publisher vertically, when you really have to.

transactional-replication

 

Merge Replication

In Merge Replication, Subscriber synchronizes with the Publisher when connected to the network and exchanges all rows that have changed between the Publisher and Subscriber since the last synchronization occurred. You may see Merge Replication as a batch update from Subscriber to Publisher that propagates only the net data changes. For example, if a row changes five times at a Subscriber before it synchronizes with Publisher, the row will change only change once at the Publisher to reflect the net data change (which is the 5th value). Then, the unified changes in Publisher will be propagated back to other Subscribers.

Merge Replication is suitable for situation where Subscribers need to receive data, make changes offline, and later synchronize changes with the Publisher and other Subscribers. For example a nationwide POS (point of sale) system where retail branches are spread across multiple physical locations. The retail branches will first initialize a snapshot from the Publisher database, make local offline changes (for example, through sales) to Subscriber database. The sales are not required to be propagated back to Publisher immediate. The many other retail branches also do not need immediate update on the changes happened at another retail branch although a more recent update will be beneficial, for example knowing whether another branch nearby has the stock that the local branch has run out to recommend customers where to go accordingly. Once a day or multiple times a day, depending on business need, the sales number in retail branches (Subscribers) will be propagated back to HQ (Publisher) and the executives in the HQ office can view the daily sales report.

merge-replication

Conflict can happen in Merge Replication and conflict will happen. Good news is, conflicts are resolved without the need of user intervention because SQL Server has built-in mechanism to resolve conflicts on data changes. However if you have unique use cases where you want to ensure SQL Server is doing exactly what you intended while resolving a conflict, you can view the conflict at Microsoft Replication Conflict Viewer and the outcome of the resolution can be modified.

Snapshot Replication

Snapshot Replication propagate data exactly as it appears at a specific moment in time and does not monitor for updates to the data. When synchronization occurs, the entire snapshot is generated and is sent to Subscribers. In simpler terms, Snapshot Replication takes a snapshot of the Publisher data state and overwrite it at the Subscribers. No conflict will happen as this replication basically overwrite the whole data set. Snapshot Replication is also used in both Transactional Replication and Merge Replicate to initialize the database at the Subscribers.

snapshot-replication

Snapshot Replication is suitable for system where

  • Subscriber deal with data that does not change frequently.
  • Subscriber do not require the most recent set of data for a long time.
  • The data set is small.

I was working on a small project with an advertising agency where the requirement is to analyze the trend of discussion and sentiment related to Telcos in a community forum. I quickly hack some web scraping codes to scrap what I needed and populate them into my database. As a result, I got 730 discussion topics and the total size of the database is about 5MB. From there I need to write some more algorithm to find out the trend and sentiment of discussion. During the development of my algorithm, I did not need the most updated dataset that reflects what is happening in the live forum. I pleasantly work (read) on a Subscriber node to develop and test out my algorithm. Few weeks later when the forum is updated with a lot more discussion topics, I simply replicate the changes in Publisher to my Subscriber in a fairly short amount of time. On production, knowing the analytical codes read database extensively, I pointed my codes to run on the Subscriber node. My Publisher node will not get any hit apart from getting the necessary insert operation from my web scraping service. Since my users do not need to know what was being discussed on a daily basis or on real time basis (because accurate trend and sentiment require months and months worth of data), I have configured my Snapshot Replication to happens on a monthly basis. The replication can be completed in a fairly short amount of time. On monthly basis the users will get a fresh copy of the trend and sentiment report base on at least 1 year worth of backdated forum discussion data without any performance downgrade. By implementing Snapshot Replication, I allow my web scraping service to write to Publisher without worrying if anyone need the database for reading to generate report at any particular time. Through Subscriber, I have also set up the foundation to make it possible to generate much more sophisticated reports without downgrading the performance by spinning up more Subscriber nodes when and if I need to.

Log Shipping

Log Shipping is used to automatically send transaction log backups from primary database to one or more secondary databases. The primary and secondary database should sit on different nodes. The transaction log backups are applied to each of the secondary database. Log Shipping is often used as part of disaster recovery strategy but creative database administrator often use Log Shipping for various other purposes *wink*.

In one of the projects I was working on, the SQL Server database was storing 130k registered users and their related activities such as payment history, credit spending history, Account & Contact relationship, products, login audit, etc. The company was at a rapid expansion stage where the CFO decided it’s time to get in a Business Intelligence guy to churn out some reports to give a sense how the business is doing on daily basis. The obvious thing to do here is to replicate a database for the BI colleague to run his heavy queries because running the reporting queries on production database is going to kill the poor database. The most suitable type of replication will be Transactional Replication. However the challenge was Replication requires SQL Server Enterprise Edition and we were running on SQL Server Standard Edition. The bad news is, the company did not have a lot of cash lying around for our disposal. We will have to find an alternative. After getting the green light from CTO, I implemented Log Shipping for BI reporting. In essence, I was “scaling” the SQL Server using a disaster recovery technique by offloading the reporting query load to a secondary database by replaying the transaction log to simulate Transactional Replication on an interval basis. It was the most optimum option we have in order to satisfy various stakeholders while keeping the cost low.

log-shipping

You can use this trick as long as your client does not require real time data. Take note on the following practical issues while “scaling” your SQL Server using Log Shipping.

  1. Understand that Log Shipping is a 3 steps process. First, primary database backup the transaction log at the primary server instance. Second, copy the transaction log file to the secondary server instance. Third, restore the log backup on the secondary server instance. From experience, third step Restore is the most fragile step where it broke often. To find out why Restore fails, go to SQL Server Agent to view the job history to see the detail error message.
  2. Understand your transaction backup cycle. If you have another Transaction Log backup automation happening by another agent / service, your Log Shipping will stop working fairly quickly (note: not immediately). Log Shipping works by taking all the transaction log since the last Full backup and clear off the log. It is important for SQL Server to match to log sequence. If there is another Transaction Log backup happened somewhere else, the tail of your newest Transaction Log will not match the head of last Transaction Log hence fail. If you need to use Log Shipping, disable / stop all other Transaction Log backup.
  3. Move your agent job interval up gradually. Note that Backup, Copy, Restore jobs are run by SQL Server agent on an interval. When you are setting up your Log Shipping, after the initial full database backup is restored at secondary server, your Transaction Log is ready for action. When do your Backup, Copy, and Restore jobs kick in depend on what is the interval you set while configuring your Log Shipping. I recommend you set a super short interval in the beginning so that you can monitor the failure in your setup fairly quickly. You do not want to wait for 6 hours for your Backup, Copy, Restore jobs to kick in to find out they failed, you then make some changes and wait for another 6 hours. I always start with 1 minute. Then, 5 minutes then to the actual time frame depending on business requirement. In my case, it was 2 hours.
  4. Enable Standby mode so that your client can read the data. I highly recommend you to check “Disconnect user in the database when restoring backups.”. SQL Server Agent Restore job will handle the Transaction Log intelligently by replaying the log that was missed previously. However as I mentioned earlier, Restore job is the most fragile step. Sometimes when Restore job breaks, you have to resetup the whole log shipping mechanism. Imagine a database size of 300GB (the actual size I was working with), it is pretty painful to wait for the whole process to complete. Hence, to ensure the integrity of the Transaction Log sequence, I would rather terminate all open connection to ensure my Restore step can be executed successfully. standby-mode
  5. Monitoring your Log Shipping continuity. Again, Log Shipping is pretty fragile especially if this is the first time you are doing it. You need certain mechanism to monitor your Log Shipping to ensure they are still running as expected 6 months down the road. You can open up SQL Server Agent to view the job history ensuring they are all green or configure Alert Job to raise an alert when job does not complete successfully. But personally, nothing is more assuring than knowing the data actually changes in the secondary database during specific time frame. What I do is monitoring a table column that is supposed to change after replay of Transaction Log. In my case, I monitor the latest user login time because I know this table is updated fairly frequently. The probability that no one login in the last 2 hours is close to zero. I make sure that every 2 hour the value in column changes. If you do not have a user login audit, you can make use of any table that you are sure the data will most likely change, for example your CreatedOn column on the highest transaction table.

Shielding your database

Shielding your database is a great way to keep your database load low so that you can serve more requests. But if you shield the database, where does the data come from? The data has to come from somewhere and that somewhere is known as index provider. I’m not referring to building more indexes within SQL Server because indexes served from SQL Server still add load to SQL Server and not to mention additional storage on disk. The index provider I was referring to are fast-searching services such as Solr, Elasticsearch, Azure Search, or even Redis.

This approach is to drastically offload the reading at database level to another service which are built for super fast searching. Not only the response time is much faster, you will save the poor programmers from write essay-length query to retrieve data.

Example 1: You have a clean 3rd normalized form database. With SQL Profiler you observed that clients have been issuing long query such as a query with 15 JOIN and 5 GROUP BY on very large tables. Not only the that query result comes back slowly, it also drags down the server performance and other queries are badly affected. You have discussed with your best SQL query guru to review the queries, studied the execution plans, revisited the data structure and the conclusion is that is really what it takes the get the desired result. So, what do you do now?

Index provider will come into rescue when you design your schema for your query. The joining is no longer required because the index has flatten the defined fields during indexing. The grouping is also no longer required because that info come from facets. Now instead of the JOIN and GROUP BY happen in SQL Server level, you get the data from an index lightening fast – as a simple read provided you have designed your index schema properly.

index-example-1

Example 2: You are working with 2 microservices where each of the service has its own database keeping different domain data. Your application requires data from both the services and both of them are responding to you slowly. Again, after reviewing the request with respective microservice owner and you all came to the conclusion that both the codes and SQL statement sare at the most optimum form, what do you do now?

Index provider will come into rescue when you index both the data from different microservices into one index provider and you query from the index provider instead. Not only you bypass the two slow microservices, you are also querying from one index instead of multiple databases. Of course, this is easier said than done because now it involves the two microservices to index the required data and additional effort to maintain the index provider.

index-example-2

Here are some of the practical tips you can consider while implementing index provider to shield your database from getting hit:

  1. Initial indexing will take a lot of time, especially from various sources. This is one of the reason why some deployments take long time. A trick to overcome this is to not relying on the on-the-fly index API call provided by index provider. Export your data as .csv and import them for indexing. Do it manually if your data is large enough. Automate this process if full reindexing happen on recurring basis.
  2. Minimize your index schema changes. Every schema change will require you to reindex your indexes, which usually take time and lead to longer down time. The key is when designing your schema, think holistically what is the use case and potential use cases instead of designing a schema just for one client and one use case. In the meantime, you also do not want to design an overly generic schema else it will be inefficient. The art is finding this balance and the way to do it is to understand your domain and context well before designing your schema.
  3. Ensure your stakeholders are aware that index is a reflection of your database. Your database remains as the single source of truth. Expect delay in your index. Eventual consistent is what you aim to achieve. Often, it is acceptable to have a few seconds or even minutes delay depending on how critical the data is. Ensure you get the acknowledgement from your stakeholders.
  4. Queue your indexing. A simple phone number update into database could result in 100 of reindexing request. Have a queueing mechanism to protect your index provider from sudden surge of reindexing request.
  5. Take note of the strongly type fields. Index provider such as Solr takes everything as string while index provider such as Azure Search has strongly-typed field (Edm.Boolean, Edm.String, Edm.Int32, etc). If you plan to switch index provider in the future, take care of the data type from day one else you will end up having an additional layer of mapper to deal with the data type later on.

Hope these help you in your journey to scale your SQL Server. Have fun!