In this article i will explain you how you can scrap html data from any url like amazon, flipkart using BeautifulSoup
Step 1: Install necessary python library which gonna help in scrapping data
- BeautifulSoup: Our primary module that contain method to access webpage over HTTP
pip install bs4
- Requests: It sends the http request to the url flawlessly
pip install requests
Step 2: Add this code where you want to scrap the data using BeautifulSoup
-
We make http request using requests library
r = requests.get(url, timeout=10)
-
Then we parse the content in the form of html using BeautifulSoup you can change the parser type to lxml as well
soup = BeautifulSoup(r.content, 'html.parser')
-
Then we extract data using tags
images = soup.select('img')
import requests
from bs4 import BeautifulSoup
url=''
r = ""
try:
r = requests.get(url, timeout=10)
print(r.content)
#you can change the parser type from html to lxml
soup = BeautifulSoup(r.content, 'html.parser')
list = []
images_list = []
images = soup.select('img')
# Extracting all the images from the url
for image in images:
src = image.get('src')
alt = image.get('alt')
images_list.append({"src": src, "alt": alt})
for image in images_list:
print(image)
# Extracting html tag and al the children html element and saving those html element in our local directory
for tag in soup.select('html'):
list.append(str(tag))
list2= (', '.join(list))
with open('scrap.html', 'w',encoding='UTF-8') as f:
f.write(list2)
r.raise_for_status()
except requests.exceptions.HTTPError as errh:
print ("Http Error:",errh)
except requests.exceptions.ConnectionError as errc:
print ("Error Connecting:",errc)
except requests.exceptions.Timeout as errt:
print ("Timeout Error:",errt)
except requests.exceptions.RequestException as err:
print ("OOps: Something Else",err)