How to Use Python to Scrape Data From Websites & Save It to Excel (2025 Guide)

This guide is for mid-to-large companies. You often need to collect data from websites. This guide shows you how to do it with Python. We’ll scrape data and save it to an Excel file. It’s easy to understand, even without coding experience.

What is Web Scraping?

Web scraping is automated data extraction. It pulls information from websites. This information is then saved in a structured format. Think of it like copying and pasting, but done by a computer program. It’s much faster and more efficient.

Why Use Python for Web Scraping?

Python is a popular programming language. It’s great for web scraping because:

  • It’s Easy to Learn: Python has a simple syntax. It’s relatively easy to read and write.
  • Powerful Libraries: Python has excellent libraries specifically for web scraping.
  • Large Community: Lots of resources and help are available online.

The Tools You’ll Need (Python Libraries)

We’ll use these key Python libraries:

  • requests: Gets the web page’s content. Think of it as downloading the page.
  • BeautifulSoup (from bs4): Parses the HTML. It helps you find the specific data you need.
  • openpyxl: Writes the data to an Excel file.
  • Selenium: Automates a web browser. Use it for websites with dynamic content (JavaScript).
  • Pyppeteer: Another browser automation tool. It’s similar to Selenium but uses a different approach. Good for complex interactions.

Installation:

Open your command prompt or terminal and type:

Bash

pip install requests beautifulsoup4 openpyxl selenium pyppeteer

You’ll also need to download the appropriate web driver for Selenium and Pyppeteer.

  • Selenium:
  • Pyppeteer: Installs Chromium automatically.

Method 1: Scraping Static Websites (using requests and BeautifulSoup)

Static websites display the same content to all users. The content doesn’t change dynamically.

Step 1: Get the Web Page Content

Python
from bs4 import BeautifulSoup
import requests
from openpyxl import Workbook

url = "https://www.example.com"  # Replace with the URL you want to scrape
headers = {'User-Agent': 'Mozilla/5.0'} # Mimic a browser
response = requests.get(url, headers=headers)
response.raise_for_status()  # Check for errors
html_content = response.text
  • url: The website address you want to scrape.
  • headers: This makes your request look like it’s coming from a web browser. Many websites block requests without a User-Agent.
  • requests.get(): Downloads the web page.
  • response.raise_for_status(): Checks if the download was successful. If there’s an error (like a 404 Not Found), it will stop the program.
  • response.text: Gets the HTML content of the page.

Step 2: Parse the HTML with BeautifulSoup

Python

soup = BeautifulSoup(html_content, ‘html.parser’)

  • BeautifulSoup(html_content, ‘html.parser’): Creates a BeautifulSoup object. This object lets you easily navigate and search the HTML. ‘html.parser’ is the built-in HTML parser.

Step 3: Find and Extract the Data

This is where you use BeautifulSoup’s methods to locate the specific data you need.

  • find(): Finds the first matching element.
  • find_all(): Finds all matching elements.
  • .text or .get_text(): Gets the text content of an element.

.get(‘attribute_name’): Gets the value of an HTML attribute (e.g., href for links, src for images).

Examples:

Python
# Find the first paragraph (<p> tag) and get its text:
paragraph_text = soup.find('p').text

# Find all links (<a> tags) and get their URLs:
links = soup.find_all('a')
for link in links:
    href = link.get('href')
    print(href)

# Find an element with a specific class:
element = soup.find('div', class_='my-class')

# Find an element with a specific ID:
element = soup.find(id='my-id')

# Find all images and get their source URLs
images = soup.find_all('img')
for image in images:
    src = image.get('src')
    print(src)

#Navigate to sibling tags
next_sibling = soup.find('h2').find_next_sibling()
previous_sibling = soup.find('h2').find_previous_sibling()

#Extract and modify attributes
attributes = soup.find('a').attrs
  • Inspect Element: Use your browser’s “Inspect” or “Inspect Element” tool. This helps you find the HTML tags and attributes you need to target. Right-click on the data you want on the webpage and select “Inspect” or “Inspect Element”.

Step 4: Store the Data in Excel

Python
wb = Workbook()  # Create a new Excel workbook
ws = wb.active   # Get the active worksheet
ws.title = "Scraped Data"  # Set the sheet title

# Add headers (column names)
ws.append(["Product Name", "Price", "Description"])

# Example data (replace with your actual scraped data)
products = [
    {"name": "Product 1", "price": "$10", "description": "This is product 1."},
    {"name": "Product 2", "price": "$20", "description": "This is product 2."},
]

for product in products:
    ws.append([product['name'], product['price'], product['description']])

wb.save("scraped_data.xlsx")  # Save the Excel file
  • Workbook(): Creates a new Excel workbook.
  • wb.active: Gets the active worksheet (the first sheet).
  • ws.title: Sets the title of the worksheet.
  • ws.append(): Adds a row of data to the worksheet.
  • wb.save(): Saves the workbook to a file.

Method 2: Scraping Dynamic Websites (using Selenium)

Dynamic websites load content using JavaScript. requests can’t handle this. Selenium can. It controls a real web browser.

Step 1: Set Up Selenium

Python
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service # ADDED
from selenium.webdriver.chrome.options import Options # ADDED

# --- For Headless Mode (Optional) ---
options = Options()
options.add_argument("--headless") # Run Chrome in headless mode

#service = Service('/path/to/chromedriver') # Replace with the actual path to chromedriver
driver = webdriver.Chrome(options=options) #options=options for headless
  • You need to download chromedriver (see Installation above).
  • Headless mode: (optional). The –headless option runs Chrome without a visible window. This is faster and uses fewer resources. Good for servers.

Step 2: Navigate to the Page

Python
url = "https://www.example.com/dynamic-page"  # Replace
driver.get(url)
  • driver.get(url): Opens the webpage in the automated browser.

Step 3: Interact with the Page

Selenium lets you click buttons, fill forms, and scroll.

Python
# Example: Find an element by its ID and click it:
button = driver.find_element(By.ID, 'my-button')
button.click()

# Example: Find an input field by its name and type text:
input_field = driver.find_element(By.NAME, 'my-input')
input_field.send_keys("Hello, world!")

# Example: Wait for an element to appear (important for dynamic content!)
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "dynamic-element"))
    )
finally:
  pass # Removed driver.quit() - we'll handle it later
  • find_element(By.ID, ‘…’): Finds an element by its ID.
  • find_element(By.NAME, ‘…’): Finds an element by its name.
  • find_element(By.CLASS_NAME, ‘…’): Finds an element by its class name.
  • find_element(By.CSS_SELECTOR, ‘…’): Finds an element using a CSS selector (powerful!).
  • find_element(By.XPATH, ‘…’): Finds an element using an XPath expression (very flexible).
  • click(): Clicks on an element.
  • send_keys(): Types text into an input field.
  • WebDriverWait: Waits for a specific condition to be true (e.g., an element to be visible). Crucial for dynamic websites.

Step 4: Get the Page Source (after JavaScript has loaded)

Python
html_content = driver.page_source
  • driver.page_source: Gets the current HTML source code of the page, after any JavaScript has run. This is the key difference from requests.

Step 5: Parse with BeautifulSoup (same as Method 1)

Now you have the updated HTML. Use BeautifulSoup to extract the data, just like in Method 1.

Python
soup = BeautifulSoup(html_content, 'html.parser')
# ... (use find(), find_all(), etc. to extract data) ...

Step 6: Taking a Screenshot: save_screenshot()

Python
driver.save_screenshot('screenshot.png')

Step 7: Close the Browser

Python
driver.quit()  # Close the browser and free up resources

driver.quit(): Closes the browser window and ends the WebDriver session. Always do this.

Method 3: Scraping with Pyppeteer

Pyppeteer is another browser automation library. It controls Chromium/Chrome.

Step 1: Set Up Pyppeteer

Python
import asyncio
from pyppeteer import launch

async def main():
    browser = await launch(headless=True)  # headless=False to show the browser
    page = await browser.newPage()
    await page.goto('https://www.example.com')  # Replace

    # ... (Interact with the page, extract data) ...
    html_content = await page.content() # Get Page content

    await browser.close()

asyncio.get_event_loop().run_until_complete(main())
  • asyncio: Pyppeteer uses asynchronous programming. You need async and await.
  • launch(): Starts the browser (Chromium). headless=True (default) runs it in the background.
  • newPage(): Opens a new tab.
  • goto(): Navigates to a URL.
  • page.content(): Get the HTML content.

Step 2: Interact with the Page

Python
   # Find an element by CSS selector and click it:
    button = await page.querySelector('#my-button')
    await button.click()

    # Type text into an input field:
    await page.type('#my-input', 'Hello, world!')

    # Wait for an element to appear:
    await page.waitForSelector('#dynamic-element')

     # Taking a Screenshot: screenshot()
    await page.screenshot({'path': 'screenshot.png'})
  • querySelector(): Finds an element using a CSS selector.
  • type(): Types into a field
  • click(): Clicks an element.
  • waitForSelector(): Waits for an element to appear.

Step 3: Parse with BeautifulSoup

Python
   soup = BeautifulSoup(html_content, 'html.parser')
    # ... (Extract data using BeautifulSoup) ...

Step 4: Close the Browser

Python
await browser.close()

Important Considerations

  • robots.txt: Check the website’s robots.txt file (e.g., https://www.example.com/robots.txt). It tells you which parts of the site you’re allowed to scrape. Respect it!
  • Terms of Service: Read the website’s terms of service. Web scraping might be prohibited.
  • Rate Limiting: Don’t make requests too quickly. Add delays (using time.sleep()). Websites can block you if you overload their servers.
  • User-Agent: Always set a realistic User-Agent.
  • IP Rotation: For large-scale scraping, use proxies to rotate your IP address. This helps avoid getting blocked. Consider services like Bright Data (Bright Data) or Scrape.do.
  • Error Handling: Use try…except blocks to catch errors.

FAQ

  1. Is web scraping legal?
    • It depends. Scraping publicly availablenon-copyrighted data is generally okay. Always check the website’s terms of service. Avoid scraping personal data without permission.
  2. How can I avoid getting blocked?
    • Use a realistic User-Agent. Add delays. Rotate IP addresses (proxies). Respect robots.txt.
  3. What’s the difference between requests and Selenium/Pyppeteer?
    • requests is for static websites. Selenium and Pyppeteer are for dynamic websites (that use JavaScript).
  4. What’s the difference between find() and find_all() in BeautifulSoup?
    • find() returns the first matching element. find_all() returns a list of all matching elements.
  5. How do I find the right CSS selectors or XPaths?
    • Use your browser’s “Inspect Element” tool. Right-click on the data you want and select “Inspect”.
  6. What is Headless mode in web scraping?
    • Headless mode means running a browser without a visible graphical interface. It’s faster and uses fewer resources.
  7. What is an API?
    • An API is an official way for programs to interact with a website or service. If a website offers an API, use it instead of scraping. It’s more reliable and usually permitted.

Conclusion

Python is a powerful tool for web scraping. With libraries like requests, BeautifulSoup, Selenium, and Pyppeteer, you can extract data from almost any website. Remember to scrape responsibly and ethically.

Call to Action

Need help with web scraping or data extraction projects? Contact Hir Infotech for expert data solutions. We can handle the technical complexities, so you can focus on using your data.

 #WebScraping #Python #DataExtraction #BeautifulSoup #Selenium #Pyppeteer #Excel #DataScience #DataMining #Automation #WebAutomation #2025

Scroll to Top