In this article, we will know hot to use Python library build a scrape tool, help us get data we needed from websites. Web scraping is an automated process of collecting data from the Internet. Users do not need to spend a lot of time manually retrieving data. They only need to extract the required data through code, and further arrange and query to find the information that meets the expectations. Confusing information is processed in a structured way, or made into a visual case.
Python has a very simple web scraping library, and it is very easy to clean, store and analyze data.
Prat1: Using ‘requests’ to load web pages
requests
allows us to send HTTP requests using Python.
The HTTP request returns a response object with all the response data. The following is an example of obtaining the HTML of the page:
1 | import requests |
In the above code, the content of the webpage is obtained through requests
.
Store the text as the response text as txt, and store the status_code in status
.
Then use print
to print them all.
1 | import requests |
Part2: Using BeautifulSoup to extract title
We mainly use BeautifulSoup
for web scraping, which has many functions to provide powerful solutions:
Provides a lot of simple methods and Pythonic idioms for navigating, searching and modifying the DOM tree, without a lot of code when writing applications.
Beautiful Soup is at the top of the Python parser (lxml and html5lib), you can try different parsing strategies or increase flexibility.
Let’s look at a simple example of BeautifulSoup:
1 | from bs4 import BeautifulSoup |
We use requests
to get URL information, and store the title of the page into a variable named page_title through BeautifulSoup.
You can see that once we import the BeautifulSoup of page.content, we can use the parsed DOM tree in a very Pythonic way:
1 | import requests |
Part3: <Head>and<BODY>
Similarly, we can easily extract other parts of the web. You can see that we must call .text
to get the string. Try to run the following code:
1 | import requests |
Use requests
to grab web information.
Store the title information in page_title
, store the body information in page_body
, and store the head information in page_head
.
When we print page_body
and page_head
, we will see that they are printed as strings
, but in fact, when we use print(type(page_body)) to view the type, they are not strings.
Part4: Using BeautifulSoup to select
Now let’s take a look at how to use BeautifuSoup’s method to select DOM elements.
With the soup
variable, we can use .select
to manipulate it, just like the CSS selector in BeautifulSoup, we can access the DOM tree like using CSS to select elements:
1 | import requests |
.select
returns a Python list of all elements, so we can select [0] to index the first element when selecting.
Create a variable all_h1_tags
and set it to an empty list.
Use .select
to select all the <h1>
tag lists, create a variable, and store the seventh element.
1 | import requests |
Part5: Grabbing important information
Use .select
to extract the title, use .select
to extract the product title of the element, and create a new dictionary in the format:
1 | info = { |
Here, we use the strip
method to remove all extra line breaks and spaces that may be included in the output.
1 | import requests |
First, we select all the elements in div.info-box
that provide us with a list of individual products, and then iterate through them. Because select
allows re-use to get the title. Because we have already run div.info-box
, div,info> p.title
will only return results within the range of the list. We select the 0th element of the list, then extract the text, and finally remove the extra spaces And added to the list.
Part6: Extracting link
Now that we know how to ask questions about the innerText of a text or element, let’s try to extract the links in the page. Here is an example provided:
1 | import requests |
Create a dictionary named all_links
and store all link information in it:
1 | info = { |
1 | import requests |
Part7: Generating CSV from data
We can generate a CSV template from a single element box:
1 | import requests |
Let’s take a look at how the actual case demonstrates:
1 | import requests |
After execution, we will get a csv file!
Throughout this article, we have learned how to grab text, tags, and links through Python. The list has been generated and exported as a csv file. Now you can use the learned methods to do more.
About this Post
This post is written by Siqi Shu, licensed under CC BY-NC 4.0.