Empowering Nonprofits with Python: A Pro Bono Initiative
Written on
Chapter 1: Introduction to the Initiative
Not long ago, I received a message on LinkedIn from a representative of a nonprofit organization.
Photo by Katt Yukawa on Unsplash
The individual reached out after discovering my blog on LinkedIn. She mentioned her commitment to doing a pro bono project each month, although web scraping was outside her skill set. Recognizing this as a valuable opportunity to contribute, and knowing I had relevant code ready, I readily agreed to assist.
The nonprofit had compiled a list of about 6,000 organizations, and the initial requirement was to find the URLs for these entities. The subsequent step involved extracting text from the home page of each website, while the nonprofit would handle the text analysis themselves.
Following the download of an Excel file containing the organization names, I initiated my project with the necessary imports:
from googlesearch import search
import glob
import pandas as pd
import time
pd.set_option('display.max_columns', None)
Next, I loaded the spreadsheet using pandas:
df = pd.read_excel('c:/users/denni/downloads/March 2022 CJNP.xlsx')
df.tail()
The next phase involved creating a function to retrieve the nonprofit's URL based on its name and state:
def getURL(name, location):
try:
term = name + ' ' + location
for url in search(term, num_results=1):
return urlexcept:
return ''
I then iterated through the dataframe, providing the name and location to the function, and updating the dataframe with the retrieved URLs:
for index, row in df.iterrows():
if index > 4 and index < 5638:
URL = getURL(row['Name'], row['State'])
df.at[index, 'URL'] = URL
if index % 10 == 0 and index > 1:
print(index, row['Name'], URL)
time.sleep(15)
print(index)
time.sleep(1)
Once this process was finished, I employed Selenium to extract the text from the pages. Below are my necessary imports (some of which were copied from another Jupyter notebook):
import pandas as pd
import shutil
import os, re, requests, urllib
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from datetime import datetime, date, timedelta
import datetime
from dateparser.search import search_dates
from bs4 import BeautifulSoup
import glob as glob
I loaded the file that contained the URLs I had previously gathered:
df = pd.read_excel('c:/users/denni/downloads/nonprofit.xlsx')
df.tail()
Finally, I looped through all the URLs to scrape the text from each page:
for index, row in df.iterrows():
if index > -1:
try:
driver.get(row['URL'])
text = driver.find_element(By.XPATH, "/html/body").text
df.at[index, 'Content'] = text
if index % 10 == 0:
print(index, row['URL'], len(text))except:
pass
And just like that, the project was complete!
More insights can be found at PlainEnglish.io. Subscribe to our free weekly newsletter, follow us on Twitter and LinkedIn, and join our Community Discord for further engagement.
Chapter 2: Video Insights
This tutorial video walks through the process of renewing membership online, providing a practical guide that complements the content of this article.
This webcast discusses workers' compensation tailored for nonprofits, offering valuable information for organizations looking to navigate this complex topic.