In the beginning stages of prototyping our team’s Source Directory bot, I taught myself how to scrape articles from the web. I started by practicing with Books To Scrape, which is a fantastic resource for those with no previous knowledge of the topic.
This is the very first edition of a series on web scraping for journalists. In this article, we’ll discuss why this is a valuable skill and install the modules we’ll need to get started.
Before we get into it, I wanted to share some things I’ve learned through this experience:
Web scraping is a lot easier than you would think, if you have basic knowledge of Python.
You can’t just scrape any site because most of them have firewalls, specifically to block people from doing so. Here’s one way to get around that issue, if you’re interested.
If you’re a journalist and don’t know how to code at all, you’re missing out. Coding opens up possibilities for finding stories, telling those stories in new ways, making sense of large amounts of data, automating routine processes and saving massive amounts of time — while spending the time you do have wisely.
I’d recommend coding on Visual Studio Code and keeping your projects organized in folders within the “Explorer” panel.
There are different levels to web scraping. Since this is a beginner’s guide, I’m not going to go into all the numerous ways you can organize your data or all the different elements you can pull from a webpage’s code. That’s because every project is different and there isn’t just one way to code. Understanding the basics behind the processes allows you to tailor your code to a specific need. For those of you who are interested in going beyond the basics, there are so many tutorials and resources out there to help you expand your knowledge and capabilities.
ChatGPT has been an incredibly helpful resource to me as it does an amazing job of breaking down methods and the reasoning behind a piece of code. You can ask ChatGPT why you keep getting a certain error or how to go about coding up a specific task you need help with. AI is built from code, so it really is your best friend on the subject.
Lastly, one of the greatest parts about coding is that there are no limits to what you can do, and there’s always more to learn. It’s so fulfilling when your program actually does what you want it to do, and the only way to get there is by wrestling with it and growing from each error. If you stick with it, I promise you’ll create something awesome and be proud of what you’ve done.
To me, coding is a form of art. And just as an artist has to gather any materials they’ll need for their project, a coder has to install any modules (libraries) they’ll need to use in their program.
You can install libraries on VS Code via the terminal using pip, an incredibly helpful Python software package management system. If you don’t have pip installed, follow this guide. Once you have pip, it’s easy to install other libraries via the terminal.
I’d recommend installing Pylint, a code analysis tool for Python that identifies errors in your code and enforces coding standards. Pylint is invaluable because it automatically identifies problems and clears up the confusion you may have about why your code won’t work. To install Pylint, enter this into the terminal: pip install pylint.
Next, you’re going to need BeautifulSoup, a library used to pull data out of HTML and XML files. To install BeautifulSoup, enter this into the terminal: pip install beautifulsoup4.
Now, let’s install requests, a library that makes it incredibly easy and simple to make HTTP requests. Basically, requests allows you to access URLs directly from your code, an ideal power for web scrapers. To install requests, enter this into the terminal: pip install requests.
Let’s also go ahead and install the lxml parser, which makes parsing way faster than Python’s built-in XML parsing modules. To install lxml, enter this into the terminal: pip install lxml.
Lastly, we’re going to install pandas, a very powerful open-source Python data analysis and manipulation library. I was blown away when I learned that this library existed, as it allows you to export the output of your code into an Excel spreadsheet in a structured manner. For instance, if you have a dictionary of books that includes the following keys: title, author, published_date, link and summary, the pandas library would structure the data into Excel, using each of the keys as column headers. To install pandas, enter this into the terminal: pip install pandas.
That’s all for this edition. In the next issue, we’ll learn how to inspect an article, discover the patterns behind a web code and get into the nitty-gritty of writing our own code. By the end, you’ll have scraped the text from a Daily Texan article into an Excel file. Cool, right? You won’t want to miss it.
Until next time, hook ‘em!