I am trying out Scalpel to scrape a website but got into an out of scope error using their own example code. That example is found on their github page, section My Scraping Target Doesn't Return The Markup I Expected.
I am using the ghc-8.6.4 Haskell compiler.
My packages.yaml dependencies are:
dependencies:
- base >= 4.7 && < 5
- http-conduit
- http-client
- http-client-tls
- http-types
- scalpel
The code:
{-# LANGUAGE NamedFieldPuns #-}
{-# LANGUAGE OverloadedStrings #-}
module Example where
import Text.HTML.Scalpel
import qualified Network.HTTP.Client as HTTP
import qualified Network.HTTP.Client.TLS as HTTP
import qualified Network.HTTP.Types.Header as HTTP
-- Create a new manager settings based on the default TLS manager that updates
-- the request headers to include a custom user agent.
managerSettings :: HTTP.ManagerSettings
managerSettings = HTTP.tlsManagerSettings {
HTTP.managerModifyRequest = \req -> do
req' <- HTTP.managerModifyRequest HTTP.tlsManagerSettings req
return $ req' {
HTTP.requestHeaders = (HTTP.hUserAgent, "My Custom UA")
: HTTP.requestHeaders req'
}
}
main = do
manager <- Just <$> HTTP.newManager managerSettings
html <- scrapeURLWithConfig (def { manager }) url $ htmls anySelector
maybe printError printHtml html
where
url = "https://www.google.com"
printError = putStrLn "Failed"
printHtml = mapM_ putStrLn
As you can see from the code sample, the manager constant is sitting next to the def function. But it seems like it is hiding manager somehow... I can't put my finger on what's wrong.
The entire console output from the stack build command, which contains the reported error:
jroyer$ stack build
my-okr-haskeller-0.1.0.0: build (lib + exe)
Preprocessing library for my-okr-haskeller-0.1.0.0..
Building library for my-okr-haskeller-0.1.0.0..
[2 of 3] Compiling Example ( src/Example.hs, .stack-work/dist/x86_64-osx/Cabal-2.4.0.1/build/Example.o )
/Users/jroyer/Projects/bizgithub/my-okr-haskeller/src/Example.hs:26:40: error: Not in scope: ‘manager’
|
26 | html <- scrapeURLWithConfig (def { manager }) url $ htmls anySelector
| ^^^^^^^
-- While building package my-okr-haskeller-0.1.0.0 using:
/Users/jroyer/.stack/setup-exe-cache/x86_64-osx/Cabal-simple_mPHDZzAJ_2.4.0.1_ghc-8.6.4 --builddir=.stack-work/dist/x86_64-osx/Cabal-2.4.0.1 build lib:my-okr-haskeller exe:my-okr-haskeller-exe --ghc-options " -ddump-hi -ddump-to-file -fdiagnostics-color=always"
Process exited with code: ExitFailure 1
EDIT: I can reproduce the asker's problem with an old version of scalpel, which the asker mentioned they were using:
[1 of 1] Compiling Example ( Main.hs, /var/folders/m7/_2kqsz4n4c3ck8050glq4ggr0000gn/T/cabal-repl.-26184/dist-newstyle/build/x86_64-osx/ghc-8.6.4/fake-package-0/x/script/build/script/script-tmp/Example.o )
Main.hs:34:40: error: Not in scope: ‘manager’
|
34 | html <- scrapeURLWithConfig (def { manager }) url $ htmls anySelector
| ^^^^^^^
./so.hs 16.94s user 3.89s system 114% cpu 18.155 total
This is a suboptimal error message that seems to result from using named field puns and a variable that is not a field name. That is, Config in that version of scalpel does not have a manager field. We can reproduce this issue in a smaller example:
% cat test.hs
{-# LANGUAGE NamedFieldPuns #-}
data Foo = Foo { bar :: Int } deriving (Show)
main :: IO ()
main = print (Foo { zar})
where zar = 23 :: Int
% ghc test.hs
...snipt...
test.hs:4:21: error:
Not in scope: ‘zar’
Perhaps you meant ‘bar’ (line 3)
|
4 | main = print (Foo { zar})
The solution is thus to update to a newer version of scalpel.
html <- scrapeURLWithConfig (def { manager }) url $ htmls anySelector
I have no idea what this is supposed to be. Specifically (def { manager }). That isn't any syntax I'm familiar with.
Where you have manager, there should be a field. For example:
def { someField = someValue }
not what you have of def { someValue } which makes no sense.
Ah, NamedFieldPuns. I've honestly never used them and looking at them I find myself perfering RecordWildCards. Moving on.
Looking at the haddocks, the field name is manager so you have a manager field and a manager value for the named field pun. I needed to add an import for def. At the same time I took the liberty of using cabal and a shebang to be explicit about all the packages:
#! /usr/bin/env cabal
{- cabal:
build-depends:
base >= 4
, scalpel == 0.6.0
, http-types == 0.12.3
, http-client-tls == 0.3.5.3
, http-client == 0.6.4
, data-default == 0.7.1.1
-}
{-# LANGUAGE NamedFieldPuns #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Default
import Text.HTML.Scalpel
import qualified Network.HTTP.Client as HTTP
import qualified Network.HTTP.Client.TLS as HTTP
import qualified Network.HTTP.Types.Header as HTTP
-- Create a new manager settings based on the default TLS manager that updates
-- the request headers to include a custom user agent.
managerSettings :: HTTP.ManagerSettings
managerSettings = HTTP.tlsManagerSettings {
HTTP.managerModifyRequest = \req -> do
req' <- HTTP.managerModifyRequest HTTP.tlsManagerSettings req
return $ req' {
HTTP.requestHeaders = (HTTP.hUserAgent, "My Custom UA")
: HTTP.requestHeaders req'
}
}
main = do
manager <- Just <$> HTTP.newManager managerSettings
html <- scrapeURLWithConfig (def { manager = manager }) url $ htmls anySelector
maybe printError printHtml html
where
url = "https://www.google.com"
printError = putStrLn "Failed"
printHtml = mapM_ putStrLn
Which seems to run well. Notice the module containing main should itself be Main.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With