site-header

Scraping Data with Python

12 Aug 2019
Category Python

So I reached a point where I needed to get data from a 3rd party portal to send on to customers.

I’ve got access to the backend but there’s no API so making the data useful in a customer context requires some hacking.

I ended up going with the Python Requests library and used Xpath

# create sesson
session_requests = requests.session()
# extract CSRF token using xpath and lxml
login_url = "https://example.com/login.php"
result = session_requests.get(login_url)
tree = html.fromstring(result.text)
authenticity_token = list(
set(tree.xpath("//input[@name='loginSubmit']/@value")))[0]

# login & send payload
result = session_requests.post(
login_url,
data=payload,
headers=dict(referer=login_url)
)

Lessons learned

Load all the assets

-Chasing a moving target Sites that change structure frequently are harder to scrape consistently. There will always be times when scraping breaks and needs re-adjustment.

Next Challenge

Scraping JS table data.