News Analyzer App Project

This independent project applied big data software architecture principles and machine learning techniques to analyze recent tech industry news articles, automatically extract common themes, and sort them into groups by topic. The primary aim was to help readers quickly identify current trends in media reporting on the tech industry and focus on the topics of most interest to the reader.

Architecture

The software architecture consisted of three microservices which interact via a message queue broker, as illustrated in the diagram below. First, (starting on the left side of the diagram) a data collector microservice collects news article data daily from an external internet source (newsapi.org), stores the data in a database, and publishes it to a message queue. Next, upon receiving the data from the message queue, a data analyzer microservice stores it in a database, applies Latent Dirichlet Allocation (LDA) to discover common topics, and publishes the results to another message queue. Finally, a web microservice receives the data, stores it in a database, and presents the articles sorted by topic to the end user via web pages and a REST API service.

Design

The project design was guided by current best practices in software development and big data architecture. Through the use of containerized pods with delivery-confirmed message queues and data persistence, service interruptions to the end user are minimized and the system remains robust to temporary partitions between the services. Because the collected data is well-structured, relational databases are used to efficiently store, process, and retrieve the data. In addition, test doubles and mock external services were used to implement efficient unit and integration tests in an automated continuous integration and deployment workflow. Furthermore, online metrics and visualizations permit real-time monitoring of system performance.

Tech Stack

The following technology tools were used to implement the project:

Ubuntu (22.4.4): Operating system
Kotlin (1.9.22): Programming language
Java Virtual Machine (17.0.10): Compilation and libraries
Gradle (8.7): Build tool
Ktor (2.3.8): Kotlin application framework
Netty (4.1.106): Web server
Apache Freemarker (2.3.32): Dynamic webpage templating
PostgreSQL (16.2): Relational database
Exposed (0.48.0): Object relational mapping framework
HikariCP (5.1.0): Database connection pooling
Apache Spark (3.3.2): Data analytics
Kotlin for Apache Spark (1.2.4): Kotlin-Spark compatibility API
RabbitMQ (5.21.0): Messaging broker
Junit (4.13.2): Testing
Kover (0.7.5): Test code coverage measurement
Micrometer (1.6.8): Application metrics interface
Prometheus (2.51.2): Performance metrics and monitoring storage
Grafana (10.4.2): Performance metrics visualization
Docker Engine (25.0.3): Containerization
Kubernetes (1.30.0): Deployment container orchestrator
Kompose (1.33.0): Docker Compose to Kubernetes conversion tool
Helm (3.14.4): Kubernetes package manager
GitHub: Version control, continuous integration and deployment
Google Kubernetes Engine (1.28.8): Cloud computing service

Testing

Gradle was used to implement unit and integration tests, and these tests were incorporated into the continuous integration/continuous deployment workflow. Using test doubles and mock external services, the unit tests check each element of the system (e.g., database operations, message queue, data processing, etc.), and the integration tests check that these elements function together at the app level as expected: that the data can be reliably (1) collected, (2) stored in the collector database, (3) transferred to the data analyzer, (4) processed with unsupervised machine learning (LDA), (5) stored in the analyzer database, (6) passed to the web server, (7) stored in the web-server database, and (8) displayed to the end user in reverse chronological order grouped by topic.

Production

The project was deployed on Google Kubernetes Engine using Helm.

REST API

HATEOAS principles were applied in order to facilitate hypermedia-driven discovery of the endpoints within the API.

Monitoring

Production monitoring was accomplished by scraping metrics with Prometheus and visualizing with Google Cloud Monitoring.