What is a News Aggregator?
It is a web application that collects data(news articles) from multiple websites. Then presents the data in one location.
As we all know there are tons of news sites online. They publish their content on multiple platforms. Now imagine, when you open 20–30 websites daily just to read the news articles. The time you will waste gaining information.
Now, this web app can make this task easier for you. In a news aggregator, you can select the websites you want to follow. Then the app will collect the desired articles for you.
Requirements/Prerequisite
You should basic understanding of the framework/libraries given below:
- Django Framework
- BeautifulSoup
- requests module

Setup
Setup the basic Django project with the following command:
#shell
django-admin startproject NewsAggregator
Then navigate to Project Folder, and create the app:
#shell
python manage.py startapp news
We can also store the articles in the database, so now create the model inside the models.py file.
#news/models.pyfrom django.db import models
class Headline(models.Model):
title = models.CharField(max_length=200)
image = models.URLField(null=True, blank=True)
url = models.TextField()
def __str__(self):
return self.title
We will be storing three things, title, image, and URL of the article. Also, make sure that the image field should have blank and null as true because articles can be without images.
Now, let’s start with the steps for web crawlers.
Step 1: Scrapping
To scrape the website we will use beautifulsoup library and request module. So open your views.py and start writing the code as follows:
#news/views.py
#basic importimport requests
from django.shortcuts import render, redirect
from bs4 import BeautifulSoup as BSoup
from news.models import Headline
Now create a function news_scrape() for scraping the article
def news_scrape(request):
session = requests.Session()
session.headers = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
url = "https://www.theonion.com/"
content = session.get(url, verify=False).content
soup = BSoup(content, "html.parser")
News = soup.find_all('div', {"class":"curation-module__item"})
for artcile in News:
main = artcile.find_all('a')[0]
link = main['href']
image_src = str(main.find('img')['srcset']).split(" ")[-4]
title = main['title']
news_headline = Headline()
news_headline.title = title
news_headline.url = link
news_headline.image = image_src
news_headline.save()
return redirect("./")
Then we write our view function news_scrape().
The news_scrape() method will scrape the news articles from the URL “theonion.com”.
These headers are used by our function to request the webpage. The scrapper acts like a normal HTTP client to the news site. The User-Agent key is important here.
This HTTP header will tell the server information about the client. We are using Google Bots for that purpose. When our client requests anything on the server, the server sees our request coming as a Google bot. You can configure it to look like a browser User-Agent.
In the News object, we return the <div> of a particular class. We selected this class by inspecting the webpage.
and this particular <div> has all three things(Title, image, URL)
To access the link we have used main[‘href’].
and then we have stored the data in our Headline database.
Now we have to show this data to our client. Follow these steps to achieve this.
Show the stored database objects
- create article_list() method in views.py to show the data
#news/views.pydef article(request):
headlines = Headline.objects.all()
context = {
'headlines': headlines,
}
return render(request, "news/index.html", context)
now simply use this context variable to access the data in Html template.
#index.html.....
<div class='container'> {% for headline in headlines %}
<p>{{headline.title}}</p>
<img src="{{headline.image.url}}">
<a href = "{{headline.url}}">Read Full Article</a>
{% endfor %}</div>
.....
Run the server and you are good to go. Style the webpage as you want.
Cheers!!
Happy Coding!!
Stay Safe!!!