Columbus Data Scraping: February 2015

Saturday, 28 February 2015

Data Mining Explained

Overview

Data mining is the crucial process of extracting implicit and possibly useful information from data. It uses analytical and visualization techniques to explore and present information in a format which is easily understandable by humans.

Data mining is widely used in a variety of profiling practices, such as fraud detection, marketing research, surveys and scientific discovery.

In this article I will briefly explain some of the fundamentals and its applications in the real world.

Herein I will not discuss related processes of any sorts, including Data Extraction and Data Structuring.

The Effort
Data Mining has found its application in various fields such as financial institutions, health-care & bio-informatics, business intelligence, social networks data research and many more.

Businesses use it to understand consumer behavior, analyze buying patterns of clients and expand its marketing efforts. Banks and financial institutions use it to detect credit card frauds by recognizing the patterns involved in fake transactions.

The Knack

There is definitely a knack to Data Mining, as there is with any other field of web research activities. That is why it is referred as a craft rather than a science. A craft is the skilled practicing of an occupation.

One point I would like to make here is that data mining solutions offers an analytical perspective into the performance of a company depending on the historical data but one need to consider unknown external events and deceitful activities. On the flip side it is more critical especially for Regulatory bodies to forecast such activities in advance and take necessary measures to prevent such events in future.

In Closing

There are many important niches of Web Data Research that this article has not covered. But I hope that this article will provide you a stage to drill down further into this subject, if you want to do so!

Should you have any queries, please feel free to mail me. I would be pleased to answer each of your queries in detail.

Source: http://ezinearticles.com/?Data-Mining-Explained&id=4341782

Thursday, 26 February 2015

What Is ISL Uranium Mining

In situ leach mining (ISL), also known as in-situ mining or solution mining, was first used as a means to extract low grades of uranium from ore in underground mines. First used in Wyoming in the 1950s, originally as a low production experiment at the Lucky June mine, it became a high-production, low cost method of fulfilling Atomic Energy Commission uranium requirements at Utah Construction Company's Shirley Basin mining operations in the 1960s. Pioneered through the efforts of Charles Don Snow, a uranium mining and exploration geologist employed by Utah, many of his developments are still used today in ISL mining.

What is ISL mining? According to the Wyoming Mining Association website, ISL mining is explained in the following manner. (We choose Wyoming because it is the birthplace of "solution mining" as it was originally called.)

"In-situ mining is a noninvasive, environmentally friendly mining process involving minimal surface disturbance which extracts uranium from porous sandstone aquifers by reversing the natural processes which deposited the uranium.

To be mined in situ, the uranium deposit must occur in permeable sandstone aquifers. These sandstone aquifers provide the "plumbing system" for both the original emplacement and the recovery of the uranium. The uranium was emplaced by weakly oxidizing ground water which moved through the plumbing systems of the geologic formation. To effectively extract uranium deposited from ground water, a company must first thoroughly define this plumbing system and then designs well fields that best fit the natural hydro-geological conditions.

Detailed mapping techniques, using geophysical data from standard logging tools, have been developed by uranium companies. These innovative mapping methods define the geologic controls of the original solutions, so that these same routes can be retraced for effective in situ leaching of the ore. Once the geometry of the ore bodies is known, the locations of injection and recovery wells are planned to effectively contact the uranium. This technique has been used in several thousand wells covering hundreds of acres.

Following the installation of the well field, a leaching solution (or lixiviant), consisting of native ground water containing dissolved oxygen and carbon dioxide, is delivered to the uranium-bearing strata through the injection wells. Once in contact with the mineralization, the lixiviant oxidizes the uranium minerals, which allows the uranium to dissolve in the ground water. Production wells, located between the injection wells, intercept the pregnant lixiviant and pump it to the surface. A centralized ion-exchange facility extracts the uranium from the barren lixiviant, stripped of uranium, is regenerated with oxygen and carbon dioxide and recirculated for continued leaching. The ion exchange resin, which becomes 'loaded' with uranium, it is stripped or eluted. Once eluted, the ion exchange resin is returned to the well field facility.

During the mining process, slightly more water is produced from the ore-bearing formation than is reinjected. This net withdrawal, or 'bleed,' produces a cone of depression in the mining area, controlling fluid flow and confining it to the mining zone. The mined aquifer is surrounded, both laterally and above and below, by monitor wells which are frequently sampled to ensure that all mining fluids are retained within the mining zone. The 'bleed' also provides a chemical bleed on the aquifer to limit the buildup of species like sulfate and chloride which are affected by the leaching process. The 'bleed' water is treated for removal of uranium and radium. This treated water is then disposed of through waste water land application, or irrigation. A very small volume of radioactive sludge results; this sludge is disposed of at an NRC licensed uranium tailings facility.

The ion exchange resin is stripped of its uranium, and the resulting rich eluate is precipitated to produce a yellow cake slurry. This slurry is dewatered and dried to a final drummed uranium concentrate.

At the conclusion of the leaching process in a well field area, the same injection and production wells and surface facilities are used for restoration of the affected ground water. Ground water restoration is accomplished in three ways. First, the water in the leach zone is removed by "ground water sweep", and native ground water flows in to replace the removed contaminated water. The water which is removed is again treated to remove radionuclides and disposed of in irrigation. Second, the water which is removed is processed to purify it, typically with reverse osmosis, and the pure water is injected into the affected aquifer. This reinjection of very pure water results in a large increment of water quality improvement in a short time period. Third, the soluble metal ions which resulted from the oxidation of the ore zone are chemically immobilized by injecting a reducing chemical into the ore zone, immobilizing these constituents in situ. Ground water restoration is continued until the affected water is suit
able for its pre-mining use.

Throughout the leaching and restoration processes, a company ensures the isolation of the leach zone by careful well placement and construction. The well fields are extensively monitored to prevent the contamination of other aquifers.

Once mining is complete, the aquifer is restored by pumping fresh water through the aquifer until the ground water meets the pre-mining use.

In situ mining has several advantages over conventional mining. First, the environmental impact is minimal, as the affected water is restored at the conclusion of mining. Second, it is lower cost, allowing Wyoming's low grade deposits to compete globally with the very high grade deposits of Canada. Finally the method is safe and proven, resulting in minimal employee exposure to health risks."

ISL mining may be the wave of the future of U.S. uranium mining, or it may become an interim mining measure, in areas where the geology is appropriate for IS. Until sufficient quantities of uranium are required by U.S. utilities to fuel the country's demand for nuclear energy, ISL mining may remain the leading uranium mining method in the United States. At some point, an overwhelming need for uranium for the nuclear fuel cycle may again put ISL mining in the backseat, and uranium miners may return to conventional mining methods, such as open pit mining.

Source: http://ezinearticles.com/?What-Is-ISL-Uranium-Mining&id=183880

Sunday, 22 February 2015

Ancient Basic Tools to Green Light Laser: The Evolution of Mining

Mining is the process of extracting minerals and geological materials from the earth. Miners help recover many elements. These materials are rare as they are not grown, agriculturally processed or artificially created. Precious metals, coal, diamonds, and gold are just some of these materials. Mining also helps man to unearth non-renewable energy source like natural gas, petroleum, and even water. The job of miners can be difficult and risky. Thanks to efficient mining equipment, the task is a lot easier now.

People of the ancient time made use of the earth for many purposes. One way to make a living at the time is by mining. Equipment were not fully developed but people managed to unearth many precious stones and different kinds of metals. They use these minerals and elements in making basic tools for hunting and warfare. High quality flints found in masses of sedimentary rocks were in-demand in many parts of Europe. People used these flints as weapons during the Stone Age.

Ancient Egyptians were among the first to successfully get minerals from earth. Their advanced level of civilization made it possible for them to produce quality mining tools. They mined malachite and gold. Malachites are green stones used for pottery and as ornaments. The Egyptians started to quarry for other minerals not found in their soils. They head to Nubia, a part of Africa. There they used iron tools as mining equipment. That was the time when fire-setting was used to extract gold from ores. This method involves setting the rock containing the mineral against another rock, heat it and douse it with water. This was the most effective mining method that time.

The Romans also played an important part in the history of mining. They were the first to use large scale quarrying methods. An example of this is the application of volumes of water to operate simple machinery and remove debris. This is the birth of hydraulic mining.

The demand for metal increased dramatically in the 1300s. This was the time when swords, armors, and other weapons were in-demand. For this reason, miners looked for more sources of iron and silver. There was also an increase in the demand for coins that caused shortage of silver. Iron, on the other hand, was utilized in building constructions. With the high value of these materials, machineries and other mining equipment became in demand in the market.

These machines and equipment were the mothers of the present mining tools that we have today. Miners today use bulldozers, explosives and trucks. More advanced form of mining tools includes the use of green light laser serving as saw guides and machine alignment. With all these modern equipment, miners now have a safer and faster process to break down rocks and even carve out mountains. All these materials are produced and applied with the principles of engineering.

As of today, there are five major mining categories. They are coal, metal ore, non-metallic mineral mining, oil and gas extraction. Oil and gas extraction is among the biggest industries in the world today.

Source: http://ezinearticles.com/?Ancient-Basic-Tools-to-Green-Light-Laser:-The-Evolution-of-Mining&id=6768619

Thursday, 19 February 2015

Online Retail - Mining for Gold

Online retailers live in an ever-changing environment, and the ability to stay competitive is the difference between doing well and doing nothing. In today's fast paced internet market place, if you aren't using web scraping, you are missing a key component to growing your business.

Data Mining

Data mining your competition's prices and services and making sure your prices and services are similar, or even lower, is what makes the difference. Why should your customer choose you if they can get the same product somewhere else for less? What data you collect and how often you update it is also another key ingredient to success.

Extract Website Data

Web scraping allows you to gather information from your competition and use it improve your position in the market. When you extract website data from your competitor's website, it allows you to conduct business from a position that doesn't involve guess work. The internet is an environment that is constantly being updated and changed. It is vital that you have the ability to have up-to-date information on what others in your market are doing. If you can't do this, you really can't compete.

Application of Information

When you know what your competitors are doing all the time, you can keep your business a little more competitive than they are. When you have information such as monthly and even weekly price variations in the market and what products and services are being offered, you can apply that information to your own pricing matrix and ensure a competitive edge in your market.

An Army of One

Web scraping gives you the ability to see what is going on in the market at all times. You can monitor just about anything you choose with a web scraping service. Many online retailers are very small operations and they don't have the resources to constantly monitor each competitor's website - so engaging a web scraping service is like having your own marketing and research team working for you night and day to keep tabs on them. You choose what it is you want to know, and your research team goes to work. Simple.

Staying Ahead of Trends

Having the ability to recognize trends is the key to any business, especially on the internet were information is so fluid. The business that can identify a trend quickly and take advantage of it will always stay one step ahead. That's why big corporations have teams dedicated to researching market trends and predictions. If you can see where something is going, you can always get ahead of it. That's what web scraping can help you do - identify those trends in your market so you can get in ahead of the pack.

A Helping Hand

Sometimes running your own online retail business can be a daunting and lonely ordeal. Even those that have a great deal of experience with the internet can feel lost at times. A web scraping service is a tool you can use to help yourself in such times. Web scraping is automated and precise, and it gives you the ability to have vital information delivered to you in a manner you can understand and use. It's one less thing to worry about - and the information you get from data mining is what every business owner actually should worry about - what the competition is doing? With a web scraping service, you can concern yourself with other things - like making more profits.

Source: http://ezinearticles.com/?Online-Retail---Mining-for-Gold&id=6531024

Wednesday, 18 February 2015

Commercial Kitchen Ventilation and Extraction - What You Need to Know

There are a number of things to consider when installing commercial kitchen ventilation and there are several different types of systems available - but all must comply with the "Standard for kitchen ventilation systems DW172". A commercial kitchen cannot operate effectively without a properly designed and functioning ventilation system. Getting the design of the correct system for YOUR premises can be complex. All systems are operation and site specific - how you move the air, where you move it to and what you have to do with it to ensure compliance not only with the relevant legislation, but also any local building and environmental constraints.

The factors that may need to be addressed include not only physically moving the air, but heat, humidity, smoke, fire, grease and odour. There are various filter and safety systems available that deal with any or all of these issues and the best system for you will depend on your site, its surroundings and your budget. You may also have to deal with noise from the fan(s) and any planning issues relating to external ducting.

In basic terms a ventilation system comprises a canopy over the production area with a fan linked by ducting to a filter bank within the kitchen extraction canopy which draws the air out to the external exhaust point. The fan is sized in direct relation to the amount of air that has to be moved, where it has to be moved to (the exhaust point) and how quickly (depending on the type of food being cooked).

In addition, mechanical provision must be made to replace 85% of the air that is being extracted. This is called "Make up Air", the other 15% is made up by natural means - general kitchen areas and windows etc.

Within the design, careful consideration must also be given to ensure adequate access for cleaning of the duct and servicing of the fans.

If the production equipment is gas, in accordance with British Standard (BS6173) you will have to fit a Gas Interlock system. This system automatically shuts off the gas supply to the cooking equipment in the event of a failure in the ventilation system.

You may also want to consider the installation of a Heat Recovery unit which reclaims the heat (and some of the fuel cost) from your kitchen that would normally be blasted straight out through from your extracton canopy.

Source:http://ezinearticles.com/?Commercial-Kitchen-Ventilation-and-Extraction---What-You-Need-to-Know&id=6438003

Sunday, 15 February 2015

The Trouble With Bots, Spiders and Scrapers

With the Q4 State of the Internet - Security Report due out later this month, we continue to preview sections of it.

Earlier this week we told you about a DDoS attack from a group claiming to be Lizard Squad. Today we look at how
third-party content bots and scrapers are becoming more prevalent as developers seek to gather, store, sort and present
a wealth of information available from other websites.

These meta searches typically use APIs to access data, but many now use screen-scraping to collect information.

As the use of bots and scrapers continues to surge, there's an increased burden on webservers. While bot behavior is
mainly harmless, poorly-coded bots can hurt site performance and resemble DDoS attacks. Or, they may be part of a rival's competitive intelligence program.

Understanding the different categories of third-party content bots, how they affect a website, and how to mitigate their impact is an important part of building a secure web presence.

Specifically, Akamai has seen bots and scrapers used for such purposes as:

•    Setting up fraudulent sites
•    Reuse of consumer price indices
•    Analysis of corporate financial statements
•    Metasearch engines
•    Search engines
•    Data mashups
•    Analysis of stock portfolios
•    Competitive intelligence
•    Location tracking

During 2014 Akamai observed a substantial increase in the number of bots and scrapers hitting the travel, hotel and hospitality sectors. The growth in scrapers targeting these sectors is likely driven by the rise of rapidly developed mobile apps that use scrapers as the fastest and easiest way to collect information from disparate websites.

Scrapers target room rate pages for hotels, pricing and schedules for airlines. In many cases that Akamai investigated, scrapers and bots made several thousand requests per second, far in excess of what can be expected by a human using a web browser.

An interesting development in the use of headless browsers is the advent of companies that offer scraping as a service, such as PhantomJs Cloud. These sites make it easy for users to scrape content and have it delivered, lowering the bar to entry and making it easier for unskilled individuals to scrape content while hiding behind a service.

For each type of bot, there is a corresponding mitigation strategy.

The key to mitigating aggressive, undesirable bots is to reduce their efficiency. In most cases, highly aggressive bots are only helpful to their controllers if they can scrape a lot of content very quickly. By reducing the efficiency of the bot through rate controls, tar pits or spider traps, bot-herders can be driven elsewhere for the data they need.

Aggressive but desirable bots are a slightly different problem. These bots adversely impact operations, but they bring a benefit to the organization. Therefore, it is impractical to block them fully. Rate controls with a high threshold, or a user-prioritization application (UPA) product, are a good way to minimize the impact of a bot. This permits the bot access to the site until the number of requests reaches a set threshold, at which point the bot is blocked or sent to a waiting room. In the meantime, legitimate users are able to access the site normally.

Source: https://blogs.akamai.com/2015/01/performance-mitigation-bots-spiders-and-scrapers.html

Wednesday, 11 February 2015

I Don’t Need No Stinking API: Web Scraping For Fun and Profit

If you’ve ever needed to pull data from a third party website, chances are you started by checking to see if they had an official API. But did you know that there’s a source of structured data that virtually every website on the internet supports automatically, by default?
scraper toolThat’s right, we’re talking about pulling our data straight out of HTML — otherwise known as web scraping. Here’s why web scraping is awesome:

Any content that can be viewed on a webpage can be scraped. Period.

If a website provides a way for a visitor’s browser to download content and render that content in a structured way, then almost by definition, that content can be accessed programmatically. In this article, I’ll show you how.

Over the past few years, I’ve scraped dozens of websites — from music blogs and fashion retailers to the USPTO and undocumented JSON endpoints I found by inspecting network traffic in my browser.

There are some tricks that site owners will use to thwart this type of access — which we’ll dive into later — but they almost all have simple work-arounds.

Why You Should Scrape

But first we’ll start with some great reasons why you should consider web scraping first, before you start looking for APIs or RSS feeds or other, more traditional forms of structured data.

Websites are More Important Than APIs

The biggest one is that site owners generally care way more about maintaining their public-facing visitor website than they do about their structured data feeds.

We’ve seen it very publicly with Twitter clamping down on their developer ecosystem, and I’ve seen it multiple times in my projects where APIs change or feeds move without warning.

Sometimes it’s deliberate, but most of the time these sorts of problems happen because no one at the organization really cares or maintains the structured data. If it goes offline or gets horribly mangled, no one really notices.

Whereas if the website goes down or is having issues, that’s a more of an in-your-face, drop-everything-until-this-is-fixed kind of problem, and gets dealt with quickly.

No Rate-Limiting

Another thing to think about is that the concept of rate-limiting is virtually non-existent for public websites.

Aside from the occasional captchas on sign up pages, most businesses generally don’t build a lot of defenses against automated access. I’ve scraped a single site for over 4 hours at a time and not seen any issues.

Unless you’re making concurrent requests, you probably won’t be viewed as a DDOS attack, you’ll just show up as a super-avid visitor in the logs, in case anyone’s looking.

Anonymous Access

There are also fewer ways for the website’s administrators to track your behavior, which can be useful if you want gather data more privately.

With APIs, you often have to register to get a key and then send along that key with every request. But with simple HTTP requests, you’re basically anonymous besides your IP address and cookies, which can be easily spoofed.

The Data’s Already in Your Face

Web scraping is also universally available, as I mentioned earlier. You don’t have to wait for a site to open up an API or even contact anyone at the organization. Just spend some time browsing the site until you find the data you need and figure out some basic access patterns — which we’ll talk about next.

Let’s Get to Scraping

So you’ve decided you want to dive in and start grabbing data like a true hacker. Awesome.

Just like reading API docs, it takes a bit of work up front to figure out how the data is structured and how you can access it. Unlike APIs however, there’s really no documentation so you have to be a little clever about it.

I’ll share some of the tips I’ve learned along the way.

Fetching the Data

So the first thing you’re going to need to do is fetch the data. You’ll need to start by finding your “endpoints” — the URL or URLs that return the data you need.

If you know you need your information organized in a certain way — or only need a specific subset of it — you can browse through the site using their navigation. Pay attention to the URLs and how they change as you click between sections and drill down into sub-sections.

The other option for getting started is to go straight to the site’s search functionality. Try typing in a few different terms and again, pay attention to the URL and how it changes depending on what you search for. You’ll probably see a GET parameter like q= that always changes based on you search term.

Try removing other unnecessary GET parameters from the URL, until you’re left with only the ones you need to load your data. Make sure that there’s always a beginning ? to start the query string and a & between each key/value pair.

Dealing with Pagination

At this point, you should be starting to see the data you want access to, but there’s usually some sort of pagination issue keeping you from seeing all of it at once. Most regular APIs do this as well, to keep single requests from slamming the database.

Usually, clicking to page 2 adds some sort of offset= parameter to the URL, which is usually either the page number or else the number of items displayed on the page. Try changing this to some really high number and see what response you get when you “fall off the end” of the data.

With this information, you can now iterate over every page of results, incrementing the offset parameter as necessary, until you hit that “end of data” condition.

The other thing you can try doing is changing the “Display X Per Page” which most pagination UIs now have. Again, look for a new GET parameter to be appended to the URL which indicates how many items are on the page.

Try setting this to some arbitrarily large number to see if the server will return all the information you need in a single request. Sometimes there’ll be some limits enforced server-side that you can’t get around by tampering with this, but it’s still worth a shot since it can cut down on the number of pages you must paginate through to get all the data you need.

AJAX Isn’t That Bad!

Sometimes people see web pages with URL fragments # and AJAX content loading and think a site can’t be scraped. On the contrary! If a site is using AJAX to load the data, that probably makes it even easier to pull the information you need.

The AJAX response is probably coming back in some nicely-structured way (probably JSON!) in order to be rendered on the page with Javscript.

All you have to do is pull up the network tab in Web Inspector or Firebug and look through the XHR requests for the ones that seem to be pulling in your data.

Once you find it, you can leave the crufty HTML behind and focus instead on this endpoint, which is essentially an undocumented API.

(Un)structured Data?

Now that you’ve figured out how to get the data you need from the server, the somewhat tricky part is getting the data you need out of the page’s markup.

Use CSS Hooks

In my experience, this is usually straightforward since most web designers litter the markup with tons of classes and ids to provide hooks for their CSS.

You can piggyback on these to jump to the parts of the markup that contain the data you need.

Just right click on a section of information you need and pull up the Web Inspector or Firebug to look at it. Zoom up and down through the DOM tree until you find the outermost <div> around the item you want.

This <div> should be the outer wrapper around a single item you want access to. It probably has some class attribute which you can use to easily pull out all of the other wrapper elements on the page. You can then iterate over these just as you would iterate over the items returned by an API response.

A note here though: the DOM tree that is presented by the inspector isn’t always the same as the DOM tree represented by the HTML sent back by the website. It’s possible that the DOM you see in the inspector has been modified by Javascript — or sometime even the browser, if it’s in quirks mode.

Once you find the right node in the DOM tree, you should always view the source of the page (“right click” > “View Source”) to make sure the elements you need are actually showing up in the raw HTML.

This issue has caused me a number of head-scratchers.

Get a Good HTML Parsing Library

It is probably a horrible idea to try parsing the HTML of the page as a long string (although there are times I’ve needed to fall back on that). Spend some time doing research for a good HTML parsing library in your language of choice.

Most of the code I write is in Python, and I love BeautifulSoup for its error handling and super-simple API. I also love its motto:

You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help. :)

You’re going to have a bad time if you try to use an XML parser since most websites out there don’t actually validate as properly formed XML (sorry XHTML!) and will give you a ton of errors.

A good library will read in the HTML that you pull in using some HTTP library (hat tip to the Requests library if you’re writing Python) and turn it into an object that you can traverse and iterate over to your heart’s content, similar to a JSON object.

Some Traps To Know About

I should mention that some websites explicitly prohibit the use of automated scraping, so it’s a good idea to read your target site’s Terms of Use to see if you’re going to make anyone upset by scraping.

For two-thirds of the website I’ve scraped, the above steps are all you need. Just fire off a request to your “endpoint” and parse the returned data.

But sometimes, you’ll find that the response you get when scraping isn’t what you saw when you visited the site yourself.

When In Doubt, Spoof Headers

Some websites require that your User Agent string is set to something they allow, or you need to set certain cookies or other headers in order to get a proper response.

Depending on the HTTP library you’re using to make requests, this is usually pretty straightforward. I just browse the site in my web browser and then grab all of the headers that my browser is automatically sending. Then I put those in a dictionary and send them along with my request.

Note that this might mean grabbing some login or other session cookie, which might identify you and make your scraping less anonymous. It’s up to you how serious of a risk that is.

Content Behind A Login

Sometimes you might need to create an account and login to access the information you need. If you have a good HTTP library that handles logins and automatically sending session cookies (did I mention how awesome Requests is?), then you just need your scraper login before it gets to work.

Note that this obviously makes you totally non-anonymous to the third party website so all of your scraping behavior is probably pretty easy to trace back to you if anyone on their side cared to look.

Rate Limiting

I’ve never actually run into this issue myself, although I did have to plan for it one time. I was using a web service that had a strict rate limit that I knew I’d exceed fairly quickly.

Since the third party service conducted rate-limiting based on IP address (stated in their docs), my solution was to put the code that hit their service into some client-side Javascript, and then send the results back to my server from each of the clients.

This way, the requests would appear to come from thousands of different places, since each client would presumably have their own unique IP address, and none of them would individually be going over the rate limit.

Depending on your application, this could work for you.

Poorly Formed Markup

Sadly, this is the one condition that there really is no cure for. If the markup doesn’t come close to validating, then the site is not only keeping you out, but also serving a degraded browsing experience to all of their visitors.

It’s worth digging into your HTML parsing library to see if there’s any setting for error tolerance. Sometimes this can help.

If not, you can always try falling back on treating the entire HTML document as a long string and do all of your parsing as string splitting or — God forbid — a giant regex.

—

Well there’s 2000 words to get you started on web scraping. Hopefully I’ve convinced you that it’s actually a legitimate way of collecting data.

It’s a real hacker challenge to read through some HTML soup and look for patterns and structure in the markup in order to pull out the data you need. It usually doesn’t take much longer than reading some API docs and getting up to speed with a client. Plus it’s way more fun!

Source: https://blog.hartleybrody.com/web-scraping/