Artificial intelligence (AI) and machine learning (ML) seem to have piqued the interest of automated data collection providers. While web scraping has been around for some time, AI/ML implementations have appeared in the line of sight of providers only recently.
Aleksandras Šulženko, Product Owner at Oxylabs.io, who has been working with these solutions for several years, shares his insights on the importance of artificial intelligence, machine learning, and web scraping.
BN: How has the implementation of AI/ML solutions changed the way you approach development?
AS: AI/ML has an interesting work-payoff ratio. Good models can sometimes take months to write and develop. Until then, you don’t really have anything. Dedicated scrapers or parsers, on the other hand, can take up to a day or two. When you have an ML model, however, maintaining it takes a lot less time for the amount of work it covers.
So, there’s always a choice. You can build dedicated scrapers and parsers, which will take significant amounts of time and effort to maintain once they start stacking up. The other choice is to have “nothing” for a significant amount of time, but a brilliant solution later on, which will save you tons of time and effort.
There’s some theoretical point where developing custom solutions is no longer worth it. Unfortunately, there’s no mathematical formula to arrive at the correct answer. You have to make a decision when all the repetitive tasks are just too much of a hog on resources.
BN: Have these solutions had a visible impact on the deliverability and overall viability of the project?
AS: Getting started with machine learning is tough, though. It’s still, comparatively speaking, a niche specialization. In other words, you won’t find many developers that dabble in ML, and knowing how hard it can be to find one for any discipline, it’s definitely a tough river to cross.
Yet, if the business approach to scraping is based on a long-term vision, ML will definitely come in handy sometime down the road. Every good vision has scaling in it and with scaling comes repetitive tasks. These are best handled with machine learning.
Our awesome achievement we call Adaptive Parser is a great example. It was once almost unthinkable that a machine learning model could be of such high benefit. Now the solution can deliver parsed results from a multitude of e-commerce product pages, irrespective of the changes between them or any that happen over time. Such a solution is completely irreplaceable.
BN: In a previous interview, you’ve mentioned the importance of making things more user-friendly for web scraping solutions. Is there any particular reason you would recommend moving development towards no-code implementations?
AS: Even companies that have large IT departments may have issues with integration. Developers are almost always busy. Taking time out of their schedules for integration purposes is tough. Most end-users of the data Scraper APIs, after all, aren’t tech-savvy.
Additionally, the departments that would need scraping the most such as marketing, data analytics, etc., might not have enough sway in deciding the roadmaps of developers. As such, even relatively small hurdles can become impactful enough. Scrapers should now be developed with a non-tech user in mind.
There should be plenty of visuals that allow for a simplified construction of workflows with a dashboard that’s used to deliver information clearly. Scraping is becoming something done by everyone.
BN: What do you think lies in the future of scraping? Will websites become increasingly protective of their data, or will they eventually forego most anti-scraping sentiment?
AS: There are two of the answers I can give. One is “more of the same”. Surely, a boring one, but it’s inevitable. Delving deeper into scaling and proliferation of web scraping isn’t as fun as the next question — the legal context.
Currently, it seems as if our position in the industry isn’t perfectly decided. Case law forms the basis of how we think and approach web scraping. Yet, it all might change on a whim. We’re closely monitoring the developments due to the inherent fragility of the situation.
There’s a possibility that companies will realize the value of their data and start selling it on third-party marketplaces. It would reduce the value of web scraping as a whole as you could simply acquire what you need for a small price. Most businesses, after all, need the data and the insights, not web scraping. It’s a means to an end.
There’s a lot of potential in the grand vision of Web 3.0 — the initiative to make the whole Web interconnected and machine-readable. If this vision came to life, the whole data gathering landscape would be vastly transformed: the Web would become much easier to explore and organize, parsing would become a thing of the past, and webmasters would get used to the idea of their data being consumed by non-human actors.
Finally, I think user-friendliness will be the focus in the future. I don’t mean just the no-code part of scraping. A large part of getting data is exploration — finding where and how it’s stored and getting to it. Customers will often formulate an abstract request and developers will follow up with methods to acquire what is needed.
In the future, I expect, the exploration phase will be much simpler. Maybe we’ll be able to take the abstract requests and turn them into something actionable through an interface. In the end, web scraping is breaking away from its shell of being something code-ridden or hard to understand and evolving into a daily activity for everyone.
Photo Credit: Photon photo/Shutterstock