I have a couple of ideas/projects that require getting details from a URL and displaying them with a nice UI "component". One of such ideas is having better links in the References section of each post. A native list is currently being used, but it would be nice to have the title of each link displayed without compromising on my writing experiencing.
Requirements
- Build a simple service to power the References on articles
- The frontend sends links
- The service makes a request to each link and returns the following:
- Title
- SEO image
- Short description
- Favicon URL
The title
will be the only parameter used in the first iteration of this feature.
Thought process
I'll be trying a different approach this time around. Rather than spending time doing a lot of research, I'll come up with a quick solution first, then research on areas of improvements. Here's a breakdown of a quick solution:
- Make a request to the specified endpoint
- Check for
2xx
status code - Parse HTML document
- Return the parsed content to the client
Implementation
I need an endpoint to make a request to the specified URL and return the parsed content. For a start, I need these functions:
FetchPageDetails()
: A HTTP handler that initiates a request to the specified URLparseHTML()
: An internal function that processes the result of the HTTP requestparseFaviconURL()
: Builds the full URL for the favicon if only the path is providedisFullURL()
: Check if a URL contains the host/domain name
mkdir bookmark-manager && cd bookmark-managergo mod init github.com/odujokod/bookmark-manager
With the project in place, I created the main.go
,bookmark_test.go
and bookmark.go
in the root directory.
Fetching the HTML
To validate my thought process, I wrote the test to fetch a page given a URL, checking to see if a 200
response is returned. Then I'm able to implement the feature to make the test pass:
package main
import ( "fmt" "net/http" "net/http/httptest" "testing")
func TestFetchPage(t *testing.T) { externalURL := "https://google.com" path := "/fetch" url := fmt.Sprintf("%s?url=%s", path, externalURL) req, _ := http.NewRequest(http.MethodGet, url, nil) res := httptest.NewRecorder()
FetchPageDetails(res, req)
expected := 200 got := res.Result().StatusCode5 collapsed lines
if got != expected { t.Errorf("Expected: %d, got %d\n", expected, got) }}
Parsing the HTML
With the page now being fetched, I need to get the necessary details for the frontend. From the requirements, the necessary details can be found in the <head>
tag. This makes parsing slightly easier. Over to the test:
func TestParseHTML(t *testing.T) { // I can actually read this from the sample.html file sampleHTML := `<!DOCTYPE html><html lang="en"><head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta name="description" content="Description goes here"> <meta name="og:title" content="Go test"> <meta name="og:description" content="Description goes here"> <meta name="og:image" content="https://cdn1.iconfinder.com/data/icons/google-s-logo/150/Google_Icons-09-1024.png"> <title>Go test</title></head><body> <div> Hello world </div></body></html>`
12 collapsed lines
htmlBytes := []byte(sampleHTML)
got, err := ParseHTML(htmlBytes) if err != nil { t.Errorf("Unable to parse HTML: %v", err) } expectedTitle := "Go test"
if got.Title != expectedTitle { t.Errorf("Expected: %s, got: %s", expectedTitle, got.Title) }}
The test gives an insight into the implementation of the feature. I'll need a HTML parser that allows me walk through the HTML tree with ease. I found GoQuery, a library built on top of the net/html
library, to handle the HTML parsing:
go get github.com/PuerkitoBio/goquery
With GoQuery
installed, I can now implement the parsing logic:
import ( // other imports "strings"
"github.com/PuerkitoBio/goquery")
type Bookmark struct { Title string `json:"title"` Description string `json:"description"` FaviconURL string `json:"faviconURL"` ImageURL string `json:"imageURL"`}
func ParseHTML(html []byte) (Bookmark, error) { doc, err := goquery.NewDocumentFromReader(bytes.NewBuffer(html)) if err != nil { return Bookmark{}, err } bookmark := Bookmark{}18 collapsed lines
title := strings.Trim(doc.Find("title").Text(), "\n ") bookmark.Title = title
doc.Find("meta").Each(func(i int, s *goquery.Selection) { c, _ := s.Attr("name") value, _ := s.Attr("content") switch c { case "description", "og:description": bookmark.Description = value case "og:image": bookmark.ImageURL = value default: } })
return bookmark, nil}
Handling favicons
I considered using the favicon for the frontend component, so I decided to extend the response. Favicons can be specified with a fully qualified URL or a resource path. It would be easier to have a single representation for it. To do this, I need to check if the URL is a resource path or not. For a resource path, I simply append it to the main URL:
func TestIsFullURL(t *testing.T) { cases := []struct { input string expected bool }{ { input: "https://static.ietf.org/dt/12.31.0/ietf/images/ietf-logo-nor-16.png", expected: true, }, { input: "/favicon.svg", expected: false, }, { input: "/favicon.ico", expected: false, }, }
for _, c := range cases {10 collapsed lines
got, err := IsFullURL(c.input) if err != nil { t.Errorf("Error checking path: %v", err) }
if got != c.expected { t.Errorf("Expected: %v, got: %v", c.expected, got) } }}
Refactoring
With the parsing logic in place, I can now refactor the fetch test and finalise the function implementation:
package main
// import block...
func TestFetchPage(t *testing.T) { externalURL := "https://google.com" path := "/fetch" url := fmt.Sprintf("%s?url=%s", path, externalURL) req, _ := http.NewRequest(http.MethodGet, url, nil) res := httptest.NewRecorder()
FetchPageDetails(res, req)
t.Run("return 200", func(t *testing.T) { expected := 200 got := res.Result().StatusCode
if got != expected { t.Errorf("Expected: %d, got %d\n", expected, got) }16 collapsed lines
})
t.Run("confirm Title meta", func(t *testing.T) { var bookmark Bookmark err := json.NewDecoder(res.Body).Decode(&bookmark) if err != nil { t.Error("Unable to parse response") }
expected := "Google" got := bookmark.Title if got != expected { t.Errorf("Expected: %s, got: %s", expected, got) } })}
Router
A router can now be created to provide access to the client. In the main()
function of the main.go
file, I created and configured the server multiplexer:
package main
import ( "fmt" "log" "net/http")
const PORT string = ":8081"
func main() { mux := http.NewServeMux() mux.HandleFunc("GET /fetch", FetchPageDetails)
fmt.Printf("Server running on port: %s\n", PORT) log.Fatal(http.ListenAndServe(PORT, mux))}
Usage
This site is built with Astro and Markdoc is used to manage content. Without going out of scope of this article, using the API is a three step process:
- I built a
Bookmark
component in Astro - I added the
.astro
component to the Markdoc configured - In the References section, I wrapped the native list with the Markdoc/Astro component:{% bookmark type="default" %}- https://developer.mozilla.org/en-US/docs/Web/HTTP/Cookies- https://datatracker.ietf.org/doc/html/rfc6265{% /bookmark %}
The References section below is the outcome of the first phase of this feature.
Going forward
- How do I handle pages that have anti-bot?
- How should I handle missing
og:image
? - Where should I deploy? Coolify? Or a general cloud provider?
- How should storage be handled? DB or Cache or both?
- I should use Goroutines to manage simultaneous requests from the client