January 29, 2021

Coding14 - Python scrape data from websites

In this article, we will know hot to use Python library build a scrape tool, help us get data we needed from websites. Web scraping is an automated process of collecting data from the Internet. Users do not need to spend a lot of time manually retrieving data. They only need to extract the required data through code, and further arrange and query to find the information that meets the expectations. Confusing information is processed in a structured way, or made into a visual case.

Python has a very simple web scraping library, and it is very easy to clean, store and analyze data.

Prat1: Using ‘requests’ to load web pages

requests allows us to send HTTP requests using Python.

The HTTP request returns a response object with all the response data. The following is an example of obtaining the HTML of the page:

import requests

res = requests.get('https://www.bilibili.com')

print(res.text)
print(res.status_code)

In the above code, the content of the webpage is obtained through requests.

Store the text as the response text as txt, and store the status_code in status.

Then use print to print them all.

import requests

res = requests.get(
'https://www.bilibili.com')

txt = res.text
status = res.status_code

print(txt, status)

Part2: Using BeautifulSoup to extract title

We mainly use BeautifulSoup for web scraping, which has many functions to provide powerful solutions:

Provides a lot of simple methods and Pythonic idioms for navigating, searching and modifying the DOM tree, without a lot of code when writing applications.
Beautiful Soup is at the top of the Python parser (lxml and html5lib), you can try different parsing strategies or increase flexibility.

Let’s look at a simple example of BeautifulSoup:

from bs4 import BeautifulSoup

page = requests.get('https://www.bilibili.com')
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.title.text

We use requests to get URL information, and store the title of the page into a variable named page_title through BeautifulSoup.

You can see that once we import the BeautifulSoup of page.content, we can use the parsed DOM tree in a very Pythonic way:

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.bilibili.com")
soup = BeautifulSoup(page.content, 'html.parser')

page_title = soup.title.text

print(page_title)

Part3: <Head>and<BODY>

Similarly, we can easily extract other parts of the web. You can see that we must call .text to get the string. Try to run the following code:

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.bilibili.com")
soup = BeautifulSoup(page.content, 'html.parser')

page_title = soup.title.text

page_body = soup.body
page_head = soup.head

print(page_body, page_head)

Use requests to grab web information.

Store the title information in page_title, store the body information in page_body, and store the head information in page_head.

When we print page_body and page_head, we will see that they are printed as strings, but in fact, when we use print(type(page_body)) to view the type, they are not strings.

Part4: Using BeautifulSoup to select

Now let’s take a look at how to use BeautifuSoup’s method to select DOM elements.

With the soup variable, we can use .select to manipulate it, just like the CSS selector in BeautifulSoup, we can access the DOM tree like using CSS to select elements:

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.bilibili.com")
soup = BeautifulSoup(page.content, 'html.parser')

first_h1 = soup.select('h1')[0].text

.select returns a Python list of all elements, so we can select [0] to index the first element when selecting.

Create a variable all_h1_tags and set it to an empty list.

Use .select to select all the <h1> tag lists, create a variable, and store the seventh element.

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.bilibili.com")
soup = BeautifulSoup(page.content, 'html parser')

all_h1_tags = []

for element in soup.select('h1'):
  all_h1_tags.append(element.text)
  
seventh_p_text = soup.select('p')[6].text

print(all_h1_tags, seventh_p_text)

Part5: Grabbing important information

Use .select to extract the title, use .select to extract the product title of the element, and create a new dictionary in the format:

info = {
  "title": 'Asus AsusPro Adv... '.strip(),
  "review": '2 reviews\n\n\n'.strip()
}

Here, we use the strip method to remove all extra line breaks and spaces that may be included in the output.

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.bilibili.com")
soup = BeautifulSoup(page.content, 'html.parser')

top_items = []

products = soup.select('div.info-box')
for elem in products:
    title = elem.select('div.info > p.title')[0].text
    review_label = elem.select('p.play')[0].text
    info = {
        "title": title.strip(),
        "review": review_label.strip()
    }
    top_items.append(info)

print(top_items)

First, we select all the elements in div.info-box that provide us with a list of individual products, and then iterate through them. Because select allows re-use to get the title. Because we have already run div.info-box, div,info> p.title will only return results within the range of the list. We select the 0th element of the list, then extract the text, and finally remove the extra spaces And added to the list.

Part6: Extracting link

Now that we know how to ask questions about the innerText of a text or element, let’s try to extract the links in the page. Here is an example provided:

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.bilibili.com")
soup = BeautifulSoup(page.content, 'html.parser')

image_data = []

images = soup.select('img')
for image in images:
  src = image.get('src')
  alt = image.get('alt')
  image_data.append({"src": src, "alt": alt})
  
print(image_data)

Create a dictionary named all_links and store all link information in it:

info = {
	"href": "<link here>",
	"text": "<link text here>"
}

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.bilibili.com")
soup = BeautifulSoup(page.content, 'html.parser')

all_links = []

links = soup.select('a')
for ahref in links:
  text = ahref.text
  text = text.strip() if text is not None else ''
  
  href = ahref.get('href')
  href = href.strip() if href is not None else ''
  all_links.append({"href": href, "text": text})
  
print(all_links)

Part7: Generating CSV from data

We can generate a CSV template from a single element box:

import requests
from bs4 import BeautifulSoup
import csv

page = requests.get("https://www.bilibili.com")
soup = BeautifulSoup(page.content, 'html.parser')

all_products = []

products = soup.select('div.info-box')
for product in products:
  
  print("Work on product here")
  
keys = all_products[0].keys()

with open('products.csv', 'w', newline = '') as output_file:
  dict_writer = csv.DictWriter(output_file, keys)
  dict_writer.writeheader()
  dict_writer.writerows(all_products)

Let’s take a look at how the actual case demonstrates:

import requests
from bs4 import BeautifulSoup
import csv

page = requests.get("https://www.bilibili.com")
soup = BeautifulSoup(page.content, 'html.parser')

all_videos = []
videos = soup.select('div.info-box')
for video in videos:
    title = video.select('div.info > p.title')[0].text.strip()
    up_name = video.select('p.up')[0].text.strip()
    play_num = video.select('p.play')[0].text.strip()
    cover_img = video.select('img')[0].get('src')

    all_videos.append({
        "title": title,
        "upName": up_name,
        "playNum": play_num,
        "coverImg": cover_img
    })

keys = all_videos[0].keys()

with open('videos.csv', 'w', newline='') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(all_videos)

After execution, we will get a csv file!

Throughout this article, we have learned how to grab text, tags, and links through Python. The list has been generated and exported as a csv file. Now you can use the learned methods to do more.

About this Post

This post is written by Siqi Shu, licensed under CC BY-NC 4.0.