Guide to Using Blossom Search Services



Overview Search Form Output Format Advanced Integration Search Options Indexing Options File Types Files and Directories Dynamic Links Headers and Footers Page Order Activity Reports Troubleshooting FAQs		You control what Web sites and files are included in your index through the search configuration page. Each index has its own configuration page accessible by going to https://BlossomSoftware.net and logging in using the "Client Login" form. You can also turn indexing on and off on individual pages by inserting special comments in the page. For example, you might want to exclude the navigation words that appear in your page header and footer. Controlling the types of documents indexed You can choose to include documents on your site based on their file type. The types are HTML, PDF, word processing, and queries. Word processing files are .doc files such as from Microsoft Word. A query is a URL that contains a question mark; a query runs a command on your Web server. When queries are enabled, the Blossom spider will follow links that contain queries. Otherwise it will ignore those links. Controlling the directories and files indexed There are two lists associated with each index, the include list and the exclude list. The entries in both lists are case insensitive. *Include List.* The include list contains the starting URLs for the Blossom spider. Usually this is just the home page for a site. The spider will automatically include any subdirectories; you don't need to specify them explicitly. Multi-site indexes have an entry in the include list for each site. The entry doesn't have to be the home page. By placing the path to a subdirectory you can put just part of a site in an index. For example, here is an include list for an index that includes "mysite.com" and the "public" portion of "yoursite.com": https://mysite.com https://yoursite.com/public/index.html Normally, when a URL is added to the include list, the URL is also added to the queue of URLs for the spider to visit. Prefixing a URL with > adds the URL to the include list without adding it to the queue. This is useful if you'd like an index to only include documents from a site that are linked from another site. For example, an index with the include list https://mysite.com >https://yoursite.com will include just documents from "yoursite.com" that have been linked from "mysite.com". (Actually, the index will include the linked documents plus any documents on "yoursite.com" that can be reached from the linked documents. If you want just the linked documents, use $ instead of > in the include list.) You can also create an index with only the linked documents, but not the site containing the links. This include list !https://mysite.com/links.html https://yoursite.com/public/ reads the links in "mysite.com/links.html" but doesn't add "mysite.com" to the list of included directories. As a result, only documents in "yoursite.com/public" will be added to the index. *Exclude List.* The exclude list, as you probably suspect, contains paths to exclude from the index. When you place the path of a subdirectory in the exclude list, all files in that directory and all subdirectories of that directory will be excluded. The path is treated as a prefix string. All URLs, with their "http://" or "https://" removed, that match the path will be excluded. The path can include the wildcard * to match any characters. It can also include the character $ to match the end of a URL. For example, the exclude list mysite.com/private yoursite.com//private .js$ will exclude all URLs that begin with "protocolmysite.com/private" (where protocol* is "http://" or "https://"), all URLs that include "yoursite.com/" and "/private", and all URLs that end with ".js". *Inline Exclusion.* You can also exclude specific files by telling the indexer not to follow links to the files. Do this by surrounding the links in the HTML with the special comments: <!--Blossom:nofollow--> ... HTML code ... <!--Blossom:follow--> Any links in the HTML code between the comments are ignored. Using sitemaps to specify dynamic links On dynamic websites, some links may be generated programmatically rather than specified directly in HTML. While the Blossom spider does search Javascript for URLs embedded in strings, it does not execute Javascript. As a result, URLs generated dynamically by string operations can be overlooked. You can use a sitemap to guide the spider's traversal of a site. A sitemap is an XML file that lists the URLs on a website. (See https://www.sitemaps.org/ for sitemap syntax.) By default, the spider will attempt to read the file sitemap.xml in the root directory of a website. (You can turn off automatic reading of sitemaps from the Search Console.) The sitemap doesn't have to contain all the URLs on a site, only those URLs that are generated dynamically. Other URLs can be picked up in the standard way by including the site's home page. Here's an example: <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.mysite.com/dynamic_url_1.html</loc> </url> <url> <loc>http://www.mysite.com/dynamic_url_2.html</loc> </url> </urlset> Excluding Words in Headers and Footers Sometimes, finer control over indexing is needed. Suppose your pages all use a page footer that contains navigation links to various subsections of your site. Anyone searching for one of those words would find a match on every page; not very useful. Using special comments, you can exclude portions of a page from being indexed. Two commands are understood, one to turn indexing off and one to turn it back on: <!--Blossom:noindex--> ... Text ... <!--Blossom:index--> The Text between the comments is not indexed. Controlling the Order of Pages in the Search Results In search results, pages are listed according to their computed score. The score for a page is determined by: How many times the search term appears on the page; Where on the page the term occurs; How closely the terms appear to one another; How old the page is; How long the page is; The file type of the page; and The weight assigned to the page. Here are some ways you can influence page score: [Please note that changes made in the text of a Web page won't be reflected in the search results until the next time the index is updated.] Use descriptive titles. A search-term match in a title counts more than any other kind of match, for example <TITLE>This is a page title</TITLE> Use the Description and Keywords meta tags. Matches against meta tags also count heavily. The description will appear in the search results output, so it must be written clearly. However, the keyword list is never output so you can list all the terms that apply to a page. Here is an example of a Keyword meta tag that would raise the score of a page when either "price" or "cost" is the search term: <META NAME="Keywords" CONTENT="price,cost"> Set Key Phrases. The search engine interprets the "Keyphrases" meta tag as synonyms for the page title. That is, matches on phrases count as heavily as matches against the title. The difference is that key phrases do not appear in either the search output or a browser title bar. Use key phrases to direct popular searches to particular pages. Here is an example: <META NAME="Keyphrases" CONTENT="price list, price schedule"> Set page weight. You can influence the score of a page by explicitly assigning weight to the page. A positive weight moves a page towards the top of the results list, a negative weight moves a page down the list. The assigned weight is added to the other scoring factors to determine a page's score. Weights are specified by inserting the Blossom "Weight" command somewhere on the page. For example, the following command would give a page a weight of 30 (by default, pages have a weight of zero): <!--Blossom:Weight=30--> When setting weights for a page, it can be useful to see page scores in the search results list. Appending /score to the end of the query URL will display page scores along with the search results. Lower the weight of PDF files. There is no way to set the weight explicitly for PDF files, but you can lower the weight of all PDF files and thus force them to appear lower in the search results. Append the /pdf0 option to the query URL to push PDF files lower in the results. To see the score for a page, add "/score" to the search URL, for example https://searchBlossom.com/query/ID/score