Python how to scrap data from any url

Author
April 06, 2022

In this article i will explain you how you can scrap html data from any url like amazon, flipkart using BeautifulSoup

Step 1: Install necessary python library which gonna help in scrapping data

  • BeautifulSoup: Our primary module that contain method to access webpage over HTTP
pip install bs4
  • Requests: It sends the http request to the url flawlessly
pip install requests

Step 2: Add this code where you want to scrap the data using BeautifulSoup

  • We make http request using requests library

    r = requests.get(url, timeout=10)
  • Then we parse the content in the form of html using BeautifulSoup you can change the parser type to lxml as well

    soup = BeautifulSoup(r.content, 'html.parser')
  • Then we extract data using tags

    images = soup.select('img')
import requests
from bs4 import BeautifulSoup

url=''
r = ""

try:
    r = requests.get(url, timeout=10)
    print(r.content)

    #you can change the parser type from html to lxml
    soup = BeautifulSoup(r.content, 'html.parser')
    list = []
    images_list = []
    images = soup.select('img')

    # Extracting all the images from the url
    for image in images:
        src = image.get('src')
        alt = image.get('alt')
        images_list.append({"src": src, "alt": alt})

    for image in images_list:
        print(image)

    # Extracting html tag and al the children html element and saving those html element in our local directory
    for tag in soup.select('html'):
       list.append(str(tag))
    list2= (', '.join(list))
    with open('scrap.html', 'w',encoding='UTF-8') as f:
       f.write(list2)
    r.raise_for_status()
except requests.exceptions.HTTPError as errh:
    print ("Http Error:",errh)
except requests.exceptions.ConnectionError as errc:
    print ("Error Connecting:",errc)
except requests.exceptions.Timeout as errt:
    print ("Timeout Error:",errt)
except requests.exceptions.RequestException as err:
    print ("OOps: Something Else",err)