Search engine with DataOps
I wrote an example of my experience with DataOps principles and I thought it'd be helpful for me to share it, to help others understand it.
DataOps: Agility, automation, monitoring, data checks/validation, and infrastructure as code, to enhance data quality and improve time-to-value speeds.
Web crawler to Search UI flow, a robust example
I have a web crawler, which fetches unstructured, but relevant data from web pages. The crawler is developed using agile practices (iteratively improved). It also is under version control, which allows me to easily track changes and rollback any, if needed. Its deployed using a cloud platform with cloud-init, so the infrastructure is reproducible and scalable. The crawler also has its own test suite that it runs to ensure it remains reliable and efficient. It is deployed on-demand, but that too is automated, based on a trigger event.
The web crawler sends the data to a content processor (CP) component, which parses, validates, and transforms the page content into a usable format, in this case JSON. The CP is also iteratively developed with unit tests to ensure that it remains accurate and improves robustness over time.
The CP utilizes and data extractor component that identifies relevant page data that it detects using a special algorithm. It pulls date, author, title, and other information from the page. Based on my monitoring of the extracted data, I continuously optimize and improve the extraction algorithm to resolve any issues I find.
Then the CP forwards the extracted info to the indexer component. It too uses a cloud platform for deployment with a coded infrastructure asset using Fabric (an automated server provisioning tool). The Fabric script is also version controlled and updated as needed. I monitor the indexer performance and adjust the vertical scaling of the server as needed. I also monitor the data going into the indexer is of adequate quality and adjust the previous pipeline components as necessary to achieve my desired results.
After the indexing operation is performed, the data used by the indexer is sent to a data warehouse (DW), which is stored in the cloud (e.g., S3, B2, or R2). The DW is scalable and reproducible because it is a robust 3rd party vendor service.
And finally, the search interface queries and reads the data from the index to be displayed for the user. This search interface is under version control, is iteratively developed, and uses continuous deployment after each commit. I also monitor its performance and search results to gain insight into the prior sets, adjusting to improve result quality.
For me, DataOps has made my pipeline and the consuming application more reliable, scalable, and improved the quality of the results for the user.